Job Description
  • As an Site Reliability Engineer-I , you will be responsible for building/automating secure cloud Infrastructure (Infrastructure As A Code - IaaC) with various pillars Cost, Reliability, Scalability, Performance, Cost, Deployment, Service Availability - SLA/SLO/SLI, Performance etc

A Day in the Life

  • Build CICD stack collaborating across Dev and QA/Automation team and drive organization to new level of (daily/hourly) continuous delivery and deployment.
  • Security is paramount to everything we do, you will work closely with CISO, Dev team(s) and make security as first class citizens. Develop S-CICD (Secure CICD), enable various security tool chains and vulnerability reports to developers via automation.
  • Observability is very critical for the scale of our systems and ability to find insights/behavior, detect problem/failures. Looking for leads to drive this charter spanning across logs, metrics, mesh, tracing etc.
  • Collaborate closely with Dev and QA team to bring given initiative to a closer, increase adoption of DevOps practices and tool chain.
  • Apply strong analytical skills to understand production system metrics, drive change, optimize system utilization and drive cost efficiency.
  • Autoscale/down the platform during peak season scenarios.
  • Understand end to end platform architecture and how to best and fast perform triage/RCA by looking at various data points derived from observability tool chain.
  • You will be part of the 24x7 OnCall Production Support team.
  • Lead monthly operations review with the executive team. Some examples include, but are not limited to - Platform / Application / Infrastructure KPIs - UpTime, RCA , CAP
    (Corrective Action Plan) and PAP (Preventive Action Plan), security reports, audit reports.
  • You will be responsible for Operating and Managing production and staging cloud platforms, responsible for Ops (executing/automation runbook/SOP/ Maintain
    up-time/SLA) as well as Site Reliability engineering.
  • Ensure that the Platform is secured as per guidelines established by CISO. e,g, Secure against DDoS attacks by implementing WAF, Vulnerability and Patch management, install required security agents etc.
  • Lead least privilege based RBAC for various production services and tool chains.
  • Build and execute Disaster Recovery plan.
  • Key stakeholder to participate in case of IR (Incident Response).

What You Need

  • Proven work experience of 1-4 years in DevOps/SRE.
  • Solid experience with at least one of the clouds with automation focus - AWS, Azure, GCP. Certification has advantages.
  • Hands-on experience with Kubernetes along with Linux.
  • Programming experience with scripting languages e.g. Python.
  • Build and deployment experience building scalable CICD architectures and solutions is preferred.
  • Building observability stack from logs, metrics, traces, service mesh, data observability is preferred.
  • Building reliability, scalability and performance systems in Production. This requires significant engineering experience and risk evaluation.
  • Good at documenting and structuring documents for consumption by various dev teams.
  • Experience working in a Production environment with process focus is preferred.
  • Ticketing system, Incident management experience is preferred.
  • Cloud Security is a major advantage and highly preferred skill.
  • Hands-on experience with a few of these - Kafka, Postgres, SnowFlake etc. is preferred.
  • Bachelor s Degree or equivalent.

Personality Trait

  • Able to perform with cool head under pressure situations without taking any shortcuts.
  • Collaboration with solid verbal and oral communication skills are very critical to this role. Possesses excellent verbal and written communication skills and the ability to interact professionally with a diverse group of developers, product owners, and subject matter experts.
  • Strong cross-functional collaboration skills, relationship building skills, and ability to achieve results without direct reporting relationships
  • Ability to quickly identify and drive to the optimal solution when presented with a series of constraints.
  • Excellent judgment, analytical thinking, and problem-solving skills.
  • Self-motivated individual that possesses excellent time management and organizational skills.
  • Strong sense of personal responsibility and accountability for delivering high quality work.

Preferred Skills:

  • MultiCloud - AWS, Azure, GCP
  • Distributed Compute - Kubernetes (EKS/AKS), Containerization
  • Persistence stores - Postgres, MongoDB
  • DataWarehousing - Snowflake, DataBricks
  • Messaging - Kafka
  • CICD - Jenkins, ArgoCD, GitOps
  • Observability - ElasticSearch, Prometheus, Grafana, Jaeger, NewRelic etc.

What We Offer

  • Industry-Focused Certifications: Meet leading healthcare experts, discuss innovative strategies, and become a subject matter expert with our comprehensive set of certifications.
  • Rewards and Recognition: Feeling like you re outperforming on your projects? Get recognition for your dedicated efforts and demonstrated work ethic.
  • Health Insurance and Mental Well-being: We offer health benefits and insurance to you and your family for hospital-related expenses pertaining to any illness, disease, or injury. We also have Employee Assistance Programs (EAPs) to give you 24X7 access to certified therapists and psychologists.
  • Sabbatical Leave Policy: Do you want to focus on skill development, pursue an academic career, or just reset? We ve got y ou covered.
  • Open Floor Plan: Cubicles are a thing of the past and to modernize our office space, we have open floor sittings at every office location. Share ideas with your peers and bond better in an open floor office where there are no barriers and you are inspired to be creative.
  • Paternity and Maternity Leave : Enjoy the industry s best parental leave policy to welcome your bundle of joy and enjoy quality time with them.

Role: Site Reliability Engineer

Industry Type: IT Services & Consulting

Department: Engineering - Software & QA

Employment Type: Full Time, Permanent

Role Category: DevOps

Education

UG: Any Graduate

PG: Any Postgraduate

Key Skills

  • Health insurance
  • Ticketing
  • Automation
  • Linux
  • Analytical
  • Relationship building
  • Disaster recovery
  • Healthcare
  • Incident management
  • Auditing

Salary

Not Disclosed

Monthly based

Location

Uttar Pradesh , India

Paid time off Company retreats No policies at work
Job Overview
Job Posted:
1 year ago
Job Type
Full Time
Job Role
Other
Education
Graduated
Experience
2 Years
Location (Uttar Pradesh , India)