Sr Eng Manager, SRE & Observability

November 12 2024
Industries Food, Catering, Beverage
Categories Data Centre, Warehousing, Cloud,
Toronto, ON • Full time

Position Title: Sr Eng Manager, SRE & Observability

Position Type: Regular - Full-Time ​

Position Location: Toronto HQ

Requisition ID: 31044

JOB PURPOSE:

Reporting to the Director, Infrastructure Operations, the Sr Engineering Manager, SRE &Observability will be responsible for: Design, implement and monitor enterprise-grade secure fault-tolerant SRE and Observability infrastructure.

Senior manager is an engineering leader who will lead members of the engineering staff working across the organization to provide a friction-less experience to our customers and maintain the highest standards of reliability and availability. Our team thrives and succeeds in delivering high-quality technology products and services in a hyper-growth environment where priorities shift quickly. The ideal candidate has broad and deep technical knowledge experience to improve application's performance, capacity benchmarking, improve availability, security and reliability, design and evolve cloud/infrastructure architecture, and leverage engineering solutions to solve operational problems. Also should have deep technical expertise in software engineering, Kubernetes, Metrics, Logs, Traces, Synthetics, Digital Experience Monitoring, DevOps, Big data processing, and open-source Observability platform domain

JOB RESPONSIBILITIES:

  • Develop and implement a Observability and SRE strategy
  • Collaborate with the Infrastructure, applications and Data teams to understand their pain points around monitoring, performance, efficiency, reliability, availability, and formulate strategies to address recurring issues in a sustainable way.
  • Influence and build vision with application owners to ship quality products in a faster pace.
  • Ownership of the end-to-end delivery of team strategy and execution
  • Develop and motivate teams to solve complex problems and be a strong advocate for open-source technologies and solutions.
  • Be technically hands-on in coding as well as building highly available systems.
  • Be responsible for building and mentoring a new team of software engineers
  • Drive the team towards building solutions towards the long-term goals while ensuring that high priority tech debts are solved in an efficient way.
  • Be a strong thought leader in Site Reliability engineering, Observability, Operational excellence, Big Data processing, and DevOps Principles.
  • Consistently share best practices and improve processes within and across teams.
  • Hands-on Software engineering manager with strong understanding of Site Reliability Engineering, Big Data processing, Observability and DevOps principles.
  • Fluency with at least one modern language such as Python, Java, Go and experience with open-source software is a big plus.
  • Hands-on experience in managing infrastructure components through Infrastructure as Code using Terraform, Ansible
  • Strong technical acumen in Cloud Architecture, Observability, Performance Benchmarking, Capacity planning and Reliability tools.
  • Expert in Container orchestration (e.g., Kubernetes), container runtimes and OS (Operating System) optimization.
  • Experience in Observability platforms, application monitoring tools and performance analysis techniques.
  • Experience managing & growing technical leaders and teams.
  • In-depth knowledge of data structures and algorithms.
  • Expert in Open-source observability software like Grafana, Prometheus, and OTEL
  • Knowledge in ML and AI technologies
  • Develop and improve instrumentation for monitoring and logging the health and availability of services.
  • Proactively monitor systems, networks, and applications to provide input in improving the stability, security, efficiency, and scalability of systems.
  • Develop and maintain Monitoring and Logging Frameworks for all of ITX
    Take personal responsibility for the quality, reliability and availability of global IT corporate infrastructure.
  • Own operations documentation of monitoring and logging for global IT production infrastructure.
  • Participate in rotating on-call incident response on the weekdays and on the weekends.
    Improve operational efficiencies via scripting, bots and integrations.
  • Participate cross functionally with vendors and other IT engineering teams to ensure smooth service delivery.
  • Network and systems troubleshooting, fault analysis, and resolution.
  • Collaborate with Incident and Problem Management to reduce MTTR and Incident volume.
  • Design, implement, and maintain AIOps solutions to monitor and analyze IT systems, applications, and networks.
  • Deploy machine learning algorithms for anomaly detection, root cause analysis, and incident prediction.
  • Configure and manage observability tools and platforms to gain real-time visibility into system health and performance.
  • Develop monitoring dashboards, alerts, and reports to provide comprehensive insights into the IT environment.
  • Conduct root cause analysis for incidents using data from AIOps and observability tools to identify underlying issues.
  • Work closely with software engineers to instrument applications with appropriate logging, metrics, and tracing capabilities
  • Continuously analyze monitoring data to identify trends, anomalies, and opportunities for optimization.
  • Stay updated with industry trends and advancements in AIOps and observability practices, and recommend new tools or methodologies for adoption
  • Designing, developing, and implementing AI models and algorithms utilizing state-of-the-art techniques such as GPT, VAE, and GANs.
  • Collaborating with cross-functional teams to define AI project requirements and objectives, ensuring alignment with overall business goals.
  • Conducting research to stay up-to-date with the latest advancements in generative AI, machine learning, and deep learning techniques and identify opportunities to integrate them into our products and services.
  • Optimizing existing generative AI models for improved performance, scalability, and efficiency.
  • Developing and maintaining AI pipelines, including data preprocessing, feature extraction, model training, and evaluation.
  • Developing clear and concise documentation, including technical specifications, user guides, and presentations, to communicate complex AI concepts to both technical and non-technical stakeholders.
  • Contributing to the establishment of best practices and standards for generative AI development within the organization.
  • Providing technical mentorship and guidance to junior team members.
  • Apply trusted AI practices to ensure fairness, transparency, and accountability in AI models and systems
  • Drive DevOps and MLOps practices, covering continuous integration, deployment, and monitoring of AI
  • Utilize tools such as Docker, Kubernetes, and Git to build and manage AI pipelines
  • Implement monitoring and logging tools to ensure AI model performance and reliability
  • Collaborate seamlessly with software engineering and operations teams for efficient AI model integration and deployment.
  • Familiarity with DevOps and MLOps practices, including continuous integration, deployment, and monitoring of AI models.

KEY QUALIFICATION & EXPERIENCES:

  • Minimum 10 years of experience in Observability/Monitoring tools
  • Bachelor's or Master's degree in Computer Science, Computer Engineering, or a related field.
  • 5+ years of industry experience in software development.
  • In-depth experience designing at scale monitoring and logging for corporate infrastructure services.
  • Expert level experience in monitoring and logging technologies, both open source and closed source (e.g. AppDynamics, Newrelic, Datadog, Prometheus, Grafana, LogicMonitor, SumoLogic, ELK)
  • Experience in implementing Metrics, Logs and Tracing for E2E observability
  • Experience in RBAC and user based security services such as ISE, Radius, LDAP, and AD.
  • Must have strong automation/scripting skills - proficiency in Python or Golang is a plus.
  • Proficient in developing and maintaining technical documentation, runbooks, and procedures.
  • A working knowledge in Network is needed. Fundamental knowledge of TCP/IP stack, application protocols (DHCP/DNS/HTTPs) and networking concepts (HSRP/NAT/VPN/VLANs/802.1x/Wireless/Clustering/High Availability/Load Balancing).
  • Understanding of enterprise networks using Cisco IOS/NXOS with a working knowledge of IP Protocols (TCP/UDP/ICMP) and Routing Protocols (BGP/OSPF/IS-IS).
  • Technology understanding of Cisco, Cloud Native Firewalls, including Firewall Policy Rules, URL-Filtering, App-ID, User-ID, etc.
  • Experience interacting with Telco and Global ISPs (WAN/DIA) and the monitoring of those services.
  • A working knowledge of systems is needed. Fundamental knowledge of Configuration Management and Automation tools, with experience in:
    * Terraform, Ansible, Chef, Puppet, Jenkins
    * Designing and implementing CI/CD pipelines
    * Infrastructure provisioning and management
  • Strong in troubleshooting incidents in production environment.
  • A strong ownership attitude and a track record of taking responsibility for problems and pushing through to resolution.
  • Ability to communicate and coordinate with cross-functional engineering teams across multiple geographic regions.
  • Experience with AIOps and machine learning is highly desirable.
  • Knowledge of OpenTelemetry is an added advantage.
  • Experience with other monitoring tools like Prometheus, Grafana, etc.
  • Experience with Observability solutions like Dynatrace, DataDog, Instana etc. is highly desirable
  • Experience working with mainframe systems is a plus (willingness to learn is also acceptable).
  • Excellent problem-solving and analytical skills.
  • Strong communication and collaboration skills.
  • Ability to work independently and manage multiple projects simultaneously.
  • Passion for learning new technologies and continuous improvement.
  • In-depth knowledge of machine learning, deep learning, and generative AI techniques
  • Knowledge and experience in Generative AI
  • Proficiency in programming languages such as Python, R, and frameworks like TensorFlow or PyTorch
  • Strong understanding of NLP techniques and frameworks such as BERT, GPT, or Transformer models
  • Familiarity with computer vision techniques for image recognition, object detection, or image generation
  • Experience with cloud platforms such as Azure or AWS
  • Knowledge of IT operations concepts and processes, such as monitoring, incident management, root cause analysis, remediation.
  • Strong problem solving and analytical skills.
  • Strong interpersonal and written and verbal communication skills.
  • Highly adaptable to changing circumstances. Interest in continuously learning new skills and technologies.
  • Experience with programming and scripting languages (e.g. Java, C#, C++, Python, Bash, PowerShell).
  • Experience with incident and response management.

Qualifications

  • Bachelor's degree (or equivalent years of experience).
  • 5+ years of relevant work experience. SRE experience required.
  • Background in Manufacturing, Platform/Tech compnies is preferred.
  • Must have Public Cloud provider certifications (Azure, GCP or AWS)
  • Having CNCF certification is plus

OTHER INFORMATION

  • Travel: as required.
  • Job is primarily performed in a Hybrid office environment.

Key SRE and Observability Overview and Boundaries

Infrastructure Design: Requires knowledge of: Software architecture; Distributed systems; Scalability; Design patterns; Disaster Recovery; Tech

Stacks; Non-Functional Requirements; Security standards, frameworks, and methodologies (System Security Plan -SSP, Security Risk and

Compliance Review- SRCR etc.) To assist in creation of simple, modular, extensible and functional design for the product/solution in adherence to the requirements. Evaluate trade-offs while designing across multiple components in a system based on the business requirements. Convert HLD to create detailed design for specific modules / components of a product/system. Understand nuances of designing for disaster recovery. Undertake infrastructure coding automation.

Performance and Optimization : Requires knowledge of: Unix/Linux performance optimization tuning; Java/NodeJS/Tomcat/Apache tuning and optimization; Opensource Chaos tools (for example, Openblade, Chaos Monkey, Pumba, Chaos Mesh, Litmus, Chaos Toolkit, ToxiProxy) To evaluate appropriate reliability models to evaluate and estimate complex reliability parameters. Designs and develops a reliability program plan for a complex site environment. Facilitates reliability testing procedures. Ensures reliability testing procedures align with site environment changes.

Integrates the business goals of site reliability engineering and site safety engineering. Trains team members on the development and implementation of tools and applications for reliability predictions and improvements. Decides criteria selection and evaluation for site reliability analysis and assessment. Facilitates Opensource Chaos experiments to test and validate the resiliency of applications.

Solution Design : Requires knowledge of: Software architecture; Distributed systems; Scalability; Design patterns; Disaster Recovery; Tech Stacks; Minimum Viable Product- MVP; Non-Functional Requirements; Telemetry To create simple, modular, extensible and functional design in adherence to the requirements for multiple products/solutions within a domain. Understand Customer requirements and analyze the gaps between existing architecture and customer requirements. Analyze system performance impacting the complete product for non-functional requirements like reliability, operability, performance efficiency and security. Create detailed design using mock screens, pseudo codes and detailed functional logic of the modules for an entire product. Finalize the tech stack (For example MEAN, LAMP etc.) - for products/systems based on the business needs. Review the MVP to uncover risks and check for performance and usability; guide the team during MVP creation. Drive design of software, production and preproduction environments and deployment pipeline to continuously generate records for telemetry.

Coding : Requires knowledge of: Coding standards and guidelines; Coding languages (E.g. JavaScript, Python, C# etc.), frameworks(E.g. ActiveX, .Net, Cocoa, Android application framework etc.), tools(E.g. Monday.com, Linx, Embold etc.) and Platforms (E.g. Microsoft Azure, AWS , Apple IOS etc.); Quality, Safety and Security (PCI etc) standards; Emerging tools and technologies; Telemetry. To create/configure minimalistic code for entire component/application and ensure the components are meeting business/technical requirements, non-functional requirements, low-maintainability, high-availability and high-scalability needs. Assist in the selection of appropriate languages (E.g. JavaScript, Python, C# etc.), development standards and tools (E.g. Monday.com, Linx, Embold etc.)for software coding/configuration. Take initiative to learn the fundamentals of different coding languages and frameworks that would be useful for future scope of work. Build scripts for automation of repetitive and routine tasks in CI/CD (Continuous Integration/Continuous Delivery), Testing or any other process (as applicable). Implement telemetry features as required independently. Ensure security policy requirements are properly applied to components/application during code development/configuration.

Triaging and Troubleshooting : Possesses knowledge of: Regression testing; Root cause analysis (RCA); Root cause corrective action (RCCA) To analyze defects from past projects/solutions to avoid recurrence. Troubleshoots performance and availability bottlenecks for assigned application independently. Triages to detect and determine symptom versus cause of defects. Actively provides data for and participates in RCA.

Disaster Recovery Planning : Requires knowledge of: Disaster recovery procedures and processes; Enterprise disaster recovery systems. To work with business partners to identify and document critical applications. Interprets and follows procedures in contingency plans. Explains the contingency and disaster recovery plans for assigned environment. Executes established procedures necessary to continue operations in an emergency. Participates in the design of a minimum operating environment for a computer-based facility.

Monitoring and Alerting : Requires knowledge of: Monitoring and alerting tools; Monitoring metrics and key performance indicators (for example, availability, MTBF, MTTR); SLIs and SLOs (for example, request latency, availability, error rates, saturation); Distributed tracing; Alerting logic. To suggest metrics to monitor software or system performance. Monitors current performance data to ensure compliance with defined SLOs for multiple applications/systems. Determines thresholds for monitoring metrics and triggers alerts based on thresholds. Supervises specific procedures to proactively check the health of applications and infrastructure, including a variety of operating systems, hardware, and software. Makes recommendations regarding situational awareness and alerting. Make recommendations regarding instrumentation gaps and alerting logic.

Drives the execution of multiple business plans and projects by identifying customer and operational needs; developing and communicating business plans and priorities; removing barriers and obstacles that impact performance; providing resources; identifying performance standards; measuring progress and adjusting performance accordingly; developing contingency plans; and demonstrating adaptability and supporting continuous learning.

Provides supervision and development opportunities for associates by selecting and training; mentoring; assigning duties; building a team-based work environment; establishing performance expectations and conducting regular performance evaluations; providing recognition and rewards; coaching for success and improvement; and ensuring diversity awareness.

Promotes and supports company policies, procedures, mission, values, and standards of ethics and integrity by training and providing direction to others in their use and application; ensuring compliance with them; and utilizing and supporting the Open Door Policy.

Ensures business needs are being met by evaluating the ongoing effectiveness of current plans, programs, and initiatives; consulting with business partners, managers, co-workers, or other key stakeholders; soliciting, evaluating, and applying suggestions for improving efficiency and cost effectiveness; and participating in and supporting community outreach events.

The above information indicates the general nature and level of work performed by employees within this classification. It is not a comprehensive inventory of all duties, responsibilities and qualifications required of employees assigned to this job.

McCain Foods is an equal opportunity employer. We see value in ensuring we have a diverse, antiracist, inclusive, merit-based, and equitable workplace. As a global family-owned company we are proud to reflect the diverse communities around the world in which we live and work. We recognize that diversity drives our creativity, resilience, and success and makes our business stronger.

McCain is an accessible employer. If you require an accommodation throughout the recruitment process (including alternate formats of materials or accessible meeting rooms), please let us know and we will work with you to meet your needs.

The health and safety of McCain employees and their families has been our number one priority since the start of COVID-19 pandemic. With vaccination restrictions easing across the globe we do not currently require employees to be vaccinated, but we reserve the right to change this mandate in line with health guidance and regulations in each country.

Your privacy is important to us. By submitting personal data or information to us, you agree this will be handled in accordance with the Global Privacy Policy

Job Family: Information Technology
Division: Global Digital Technology
Department: ​Infrastructure and Operations ​
Location(s): CA - Canada : Ontario : Toronto || US - United States of America : Illinois : Oakbrook Terrace

Company: McCain Foods (Canada)

Apply now!

Similar offers

Searching...
No similar offer found.
An error has occured, try again later.

Jobs.ca network