Oferty pracy

Zarządzanie cudzymi kompetencjami obliguje nas do stałego podnoszenia poziomu jakości własnych usług. Wyzwania są naszą pasją.
https://www.goldenoreitc.pl/wp-content/uploads/2020/08/floating_image_14.png
bt_bb_section_bottom_section_coverage_image

Oferty pracy

2024-07-08

SITE RELIABILITY ENGINEERING MANAGER

Aktualne
Full time, Zdalnie
Opublikowano 2 tygodnie temu

   

We are seeking for a Site Reliability Engineering (SRE) Manager to join our team in a digital healthcare.

We are seeking a dedicated individual for a role that centers around Provet Cloud, our cloud-based veterinary practice management software.

The SRE Manager at our company plays a critical role in ensuring our systems’ reliability, performance, and scalability. The primary purpose of the Manager is to bridge the gap between development and operations by applying software engineering principles to infrastructure and operational challenges. This role also includes mentoring and planning of automating our infrastructure to accommodate higher loads resulting from increased usage and monitoring the cloud hosting costs to keep them at a proper level as our user base expands. The SRE team’s focus on automation, monitoring, and proactive maintenance helps us meet the demands of our expanding user base while ensuring that our services remain consistently available and performant.

Responsibilities:

  • lead, mentor, and support the SRE team members
  • oversee the monitoring, alerting, and troubleshooting of system issues
  • ensure high availability and reliability of production systems and services
  • coordinate response to system incidents and outages
  • perform post-incident reviews and ensure effective incident resolution and follow-up actions
  • manage and optimize the infrastructure, ensuring it meets current and future requirements
  • identify opportunities for automation to improve system reliability and operational efficiency
  • work closely with development, operations, and product teams to integrate reliability into the software development lifecycle
  • communicate effectively with stakeholders about system performance, incidents, and project status
  • define and track key performance indicators (KPIs) to measure system reliability and team performance
  • ensure systems adhere to security policies and compliance requirements

Requirements:

  • proficiency in AWS, Azure, or Google Cloud, and infrastructure as code (IaC) tools like Terraform
  • experience with monitoring tools like Prometheus or Grafana for real-time monitoring and alerting
  • experience in managing and responding to system incidents and outages
  • proven experience leading and managing an SRE or DevOps team
  • ability to prioritize tasks and manage multiple projects simultaneously
  • experience in planning and executing projects, including resource management and timeline adherence
  • experience working closely with cross-functional teams, including development, operations, and product teams
  • having one or more of these skills will help in succeeding in this role

Success factors and key challenges of the role:

  • maintaining high availability while simultaneously optimizing costs is crucial for the SRE Manager role. This involves balancing the need for reliability with cost-effectiveness to ensure efficient operations
  • keeping infrastructure maintained and updated with minimal downtime is essential, ideally with no noticeable interruptions for our clients and users. This requires careful planning and execution to minimize disruptions while making necessary changes
  • effective resource planning in a rapidly changing environment is critical to avoid overprovisioning while still meeting increasing demands. This involves staying proactive and adaptable to ensure resources are utilized optimally
  • continuous review and improvement of disaster recovery plans and procedures are necessary to mitigate potential risks effectively. Regular testing and updates are vital to ensure readiness for any unforeseen events
  • quick analysis and mitigation of any issues or incidents is essential, along with a clear plan for permanent resolution. This includes identifying root causes and implementing corrective measures to prevent recurrence

Nice to have:

  • focus on automating processes to improve efficiency and reduce manual intervention
  • have some experience from working in a fast growing, global SaaS company.
  • ability to use data and metrics to drive decisions and improvements
  • understanding of security best practices and compliance requirements
  • experience in performance tuning and capacity planning

What we offer:

  • the chance to work in a meaningful industry and in a fast-growing, global company on a path to changing digital healthcare
  • competitive compensation and benefits
  • learning and professional growth opportunities
  • the tools you need, and enjoy using

Employment is directly with the client

Employment based on an B2B

Work in an international environment

Fully remote

Feedback is provided within a few days of sending your CV

Cechy oferty pracy

StanowiskoJOBS

Aplikuj online

Wymagany jest prawidłowy adres e-mail.
Podziel się