Position Title: Sr Eng Manager, SRE & Observability
Position Type: Regular - Full-Time
Position Location: Toronto HQ
Requisition ID: 31044
JOB PURPOSE:
Reporting to the Director, Infrastructure Operations, the Sr Engineering Manager, SRE &Observability will be responsible for: Design, implement and monitor enterprise-grade secure fault-tolerant SRE and Observability infrastructure.
Senior manager is an engineering leader who will lead members of the engineering staff working across the organization to provide a friction-less experience to our customers and maintain the highest standards of reliability and availability. Our team thrives and succeeds in delivering high-quality technology products and services in a hyper-growth environment where priorities shift quickly. The ideal candidate has broad and deep technical knowledge experience to improve application's performance, capacity benchmarking, improve availability, security and reliability, design and evolve cloud/infrastructure architecture, and leverage engineering solutions to solve operational problems. Also should have deep technical expertise in software engineering, Kubernetes, Metrics, Logs, Traces, Synthetics, Digital Experience Monitoring, DevOps, Big data processing, and open-source Observability platform domain
JOB RESPONSIBILITIES:
KEY QUALIFICATION & EXPERIENCES:
Qualifications
OTHER INFORMATION
Key SRE and Observability Overview and Boundaries
Infrastructure Design: Requires knowledge of: Software architecture; Distributed systems; Scalability; Design patterns; Disaster Recovery; Tech
Stacks; Non-Functional Requirements; Security standards, frameworks, and methodologies (System Security Plan -SSP, Security Risk and
Compliance Review- SRCR etc.) To assist in creation of simple, modular, extensible and functional design for the product/solution in adherence to the requirements. Evaluate trade-offs while designing across multiple components in a system based on the business requirements. Convert HLD to create detailed design for specific modules / components of a product/system. Understand nuances of designing for disaster recovery. Undertake infrastructure coding automation.
Performance and Optimization : Requires knowledge of: Unix/Linux performance optimization tuning; Java/NodeJS/Tomcat/Apache tuning and optimization; Opensource Chaos tools (for example, Openblade, Chaos Monkey, Pumba, Chaos Mesh, Litmus, Chaos Toolkit, ToxiProxy) To evaluate appropriate reliability models to evaluate and estimate complex reliability parameters. Designs and develops a reliability program plan for a complex site environment. Facilitates reliability testing procedures. Ensures reliability testing procedures align with site environment changes.
Integrates the business goals of site reliability engineering and site safety engineering. Trains team members on the development and implementation of tools and applications for reliability predictions and improvements. Decides criteria selection and evaluation for site reliability analysis and assessment. Facilitates Opensource Chaos experiments to test and validate the resiliency of applications.
Solution Design : Requires knowledge of: Software architecture; Distributed systems; Scalability; Design patterns; Disaster Recovery; Tech Stacks; Minimum Viable Product- MVP; Non-Functional Requirements; Telemetry To create simple, modular, extensible and functional design in adherence to the requirements for multiple products/solutions within a domain. Understand Customer requirements and analyze the gaps between existing architecture and customer requirements. Analyze system performance impacting the complete product for non-functional requirements like reliability, operability, performance efficiency and security. Create detailed design using mock screens, pseudo codes and detailed functional logic of the modules for an entire product. Finalize the tech stack (For example MEAN, LAMP etc.) - for products/systems based on the business needs. Review the MVP to uncover risks and check for performance and usability; guide the team during MVP creation. Drive design of software, production and preproduction environments and deployment pipeline to continuously generate records for telemetry.
Coding : Requires knowledge of: Coding standards and guidelines; Coding languages (E.g. JavaScript, Python, C# etc.), frameworks(E.g. ActiveX, .Net, Cocoa, Android application framework etc.), tools(E.g. Monday.com, Linx, Embold etc.) and Platforms (E.g. Microsoft Azure, AWS , Apple IOS etc.); Quality, Safety and Security (PCI etc) standards; Emerging tools and technologies; Telemetry. To create/configure minimalistic code for entire component/application and ensure the components are meeting business/technical requirements, non-functional requirements, low-maintainability, high-availability and high-scalability needs. Assist in the selection of appropriate languages (E.g. JavaScript, Python, C# etc.), development standards and tools (E.g. Monday.com, Linx, Embold etc.)for software coding/configuration. Take initiative to learn the fundamentals of different coding languages and frameworks that would be useful for future scope of work. Build scripts for automation of repetitive and routine tasks in CI/CD (Continuous Integration/Continuous Delivery), Testing or any other process (as applicable). Implement telemetry features as required independently. Ensure security policy requirements are properly applied to components/application during code development/configuration.
Triaging and Troubleshooting : Possesses knowledge of: Regression testing; Root cause analysis (RCA); Root cause corrective action (RCCA) To analyze defects from past projects/solutions to avoid recurrence. Troubleshoots performance and availability bottlenecks for assigned application independently. Triages to detect and determine symptom versus cause of defects. Actively provides data for and participates in RCA.
Disaster Recovery Planning : Requires knowledge of: Disaster recovery procedures and processes; Enterprise disaster recovery systems. To work with business partners to identify and document critical applications. Interprets and follows procedures in contingency plans. Explains the contingency and disaster recovery plans for assigned environment. Executes established procedures necessary to continue operations in an emergency. Participates in the design of a minimum operating environment for a computer-based facility.
Monitoring and Alerting : Requires knowledge of: Monitoring and alerting tools; Monitoring metrics and key performance indicators (for example, availability, MTBF, MTTR); SLIs and SLOs (for example, request latency, availability, error rates, saturation); Distributed tracing; Alerting logic. To suggest metrics to monitor software or system performance. Monitors current performance data to ensure compliance with defined SLOs for multiple applications/systems. Determines thresholds for monitoring metrics and triggers alerts based on thresholds. Supervises specific procedures to proactively check the health of applications and infrastructure, including a variety of operating systems, hardware, and software. Makes recommendations regarding situational awareness and alerting. Make recommendations regarding instrumentation gaps and alerting logic.
Drives the execution of multiple business plans and projects by identifying customer and operational needs; developing and communicating business plans and priorities; removing barriers and obstacles that impact performance; providing resources; identifying performance standards; measuring progress and adjusting performance accordingly; developing contingency plans; and demonstrating adaptability and supporting continuous learning.
Provides supervision and development opportunities for associates by selecting and training; mentoring; assigning duties; building a team-based work environment; establishing performance expectations and conducting regular performance evaluations; providing recognition and rewards; coaching for success and improvement; and ensuring diversity awareness.
Promotes and supports company policies, procedures, mission, values, and standards of ethics and integrity by training and providing direction to others in their use and application; ensuring compliance with them; and utilizing and supporting the Open Door Policy.
Ensures business needs are being met by evaluating the ongoing effectiveness of current plans, programs, and initiatives; consulting with business partners, managers, co-workers, or other key stakeholders; soliciting, evaluating, and applying suggestions for improving efficiency and cost effectiveness; and participating in and supporting community outreach events.
The above information indicates the general nature and level of work performed by employees within this classification. It is not a comprehensive inventory of all duties, responsibilities and qualifications required of employees assigned to this job.
McCain Foods is an equal opportunity employer. We see value in ensuring we have a diverse, antiracist, inclusive, merit-based, and equitable workplace. As a global family-owned company we are proud to reflect the diverse communities around the world in which we live and work. We recognize that diversity drives our creativity, resilience, and success and makes our business stronger.
McCain is an accessible employer. If you require an accommodation throughout the recruitment process (including alternate formats of materials or accessible meeting rooms), please let us know and we will work with you to meet your needs.
The health and safety of McCain employees and their families has been our number one priority since the start of COVID-19 pandemic. With vaccination restrictions easing across the globe we do not currently require employees to be vaccinated, but we reserve the right to change this mandate in line with health guidance and regulations in each country.
Your privacy is important to us. By submitting personal data or information to us, you agree this will be handled in accordance with the Global Privacy Policy
Job Family: Information Technology
Division: Global Digital Technology
Department: Infrastructure and Operations
Location(s): CA - Canada : Ontario : Toronto || US - United States of America : Illinois : Oakbrook Terrace
Company: McCain Foods (Canada)