Senior Manager - Product Tech & SRE

Location: Singapore
Discipline: Others
Job type: Permanent
Contact email: career@bcg-rise.com
Job ref: 308296
Published: 20 days ago

Position Summary / Project Description
As the Product Lead, you will support the Product Principal with senior management in strategy discussion for application & system improvement, and also manage the reliability team.

You, together with the Reliability Leads, will ensure that the existing site reliability engineering (SREs) initiatives, such as monitoring availability, uplifting capability and automation on our products are on track. You will also assist the Reliability and Product Principal and application Teams in reviewing the reliability program to take stock of success and challenges and refine the program. You together with the reliability leads will be in charge of the management reports that describe our current situation and recommend the next steps.

As Lead of the Product team, which consists of experienced product specialists, you will be coaching the application teams and service management teams to help them improve in application reliability with tools, monitoring, prevention activities. You will collaborate with the applications, incident management (IOC) and infrastructure support teams to identify and implement procedures, tools and scripts that will improve reliability and reduce downtime while improving automation.

Role and Responsibilities

  • Gather and analyze metrics from both operating systems and product to assist in performance tuning and fault finding

    • Make monitoring and alerting alerts on symptoms and not on outages

    • Document every action so findings turn into repeatable actions – and then into automation

    • Partner with project teams to improve services through rigorous testing and release procedures

    • Participate in system design consulting, platform management, and capacity planning

    • Create sustainable systems and services through automation and uplifts

    • Debug on production incidents across services and levels of stack

    • Continuously and pro-actively sharpen technical knowledge related to the product and its roadmap

Key Tasks:

  1. Strives for automation either by coding it or by leading and influencing engineers to build systems that are easy to run in production.

    2. Identifies significant projects that result in substantial cost savings

    3. Identifies changes for the production architecture from the reliability, performance and availability perspective with a data driven approach.

    4. Proactively work on the efficiency and capacity planning to set clear requirements and reduce the system resources usage to make operating cost cheaper to run for all our customers.

    5. Identify parts of the system that do not scale, provides immediate palliative measures and drives long term resolution of these incidents.

    6. Identify Service Level Indicators (SLIs) that will align the team to meet the availability and latency objectives.

  2. Know a domain really well and radiate that knowledge through recorded demos, discussions in DNA (Design and Automation) meetings, or Incident Reviews

    2. Perform and run blameless RCAs on incidents and outages aggressively looking for answers that will prevent the incident from ever happening again.

  3. Set an example for team of specialists with positive and inclusive leadership and discussion on work.

    2. Show ownership of a major part of the infrastructure.

    3. Trusted to de-escalate conflicts inside the team

Requirements / Qualifications

  • Bachelor’s degree in computer science or other highly technical, scientific discipline

    • Ability to program (structured and OO) with one or more high level languages, such as Python, Java, C#, and JavaScript

    • Experience with infrastructure technologies like Operating Systems (Windows and Linux), networking, storage, virtualization

    • Familiar with testing automation tools

    • Have an urge for delivering quickly and iterating fast

    • A proactive approach to spotting problems, areas for improvement, and performance bottlenecks

    • Previous success in leading software engineering teams of more than 10 engineers

    • Have successfully delivered large scale software application till production

    • Excellent communication

    • Thriving as a member of a team

    • Excel under pressure

    • The ability to think fast

    • A natural problem solver

    Obtained advanced level certification with at least 10 years of working experiences with 2 or 3 of the following products

    o Weblogic/Websphere/Jboss

    o EPIC

    o Allscripts Sunrise Clinical Manager

    o SAP

    o Onbase

    o Dynatrace

    o ServiceNow

    o F5