This site uses cookies. To find out more, see our Cookies Policy

Staff Site Reliability Engineer in Cincinnati, OH at MDI Group

Date Posted: 12/15/2018

Job Snapshot

Job Description

Site Reliability Engineer
Cincinnati OH or Atlanta GA
Contract to Hire

The Site Reliability Engineer will be responsible for performance and availability of Compute and Network infrastructure consumed by all business segments. The Site Reliability teams are composed of highly talented individuals obsessively focused with availability through operational excellence. The ideal individual is relentlessly technical, passionate for automating everything and totally committed to delivering amazing customer experiences.

What you will be doing:
As a Staff Site Reliability Engineer you must have an excellent understanding of standard IT infrastructure equipment and systems - reliability and failure causes, the ability to quickly understand the key operational characteristics of new equipment and systems, interview domain experts for failure mode knowledge, and assess how possible failure models will affect measured parameters and key performance indicators (KPIs). Available 24x7 to quickly respond and resolve critical service outages severely impacting consumers.
Establish performance baseline, capacity thresholds, correlate events, and define monitoring/alerting criteria.
Develop automated solutions to address potential problems before they result in a service interruption.
Provide impact assessment and mitigation plan for changes going into the production environment.
Investigate root cause of severe and systemic outages, identify corrective actions and apply across the enterprise.
Develop availability measures that align with consumer experience to accurately assess the usability of crucial services.
Build capacity models to baseline transactional load compared to resource performance and leverage data to predict overall system capacity while automating load placement to avoid outages.
Identify thresholds for all critical links in the data path to quickly isolate where imbalances may result in potential outages.
Analyze failure points in services to model risk level and resolution steps if failure occurs.
Assist in driving architecture enhancements into system to mitigate potential failure points.
Programmatically monitor for and remediate configuration drift of critical devices.
Develop response plans to potential failure points and evaluate effectiveness during planned tests.
Perform comprehensive operational health checks of the entire services to identify areas of concern and track activities to drive improvements at all levels of the architecture.
Provide technical coaching and direction to more junior teammates.
Must Have:
Bachelor's Degree in Computer Science or “STEM” Majors.
A minimum 6 years of professional experience in Computer Science or related technical field. 
Legal authorization to work in the U.S. is required. We will not sponsor individuals for employment visas, now or in the future, for this job.
Proficient in AWS Services with one or more certifications preferred (Solutions Architect, Developer, SysOps, DevOps, etc…)
Hands-on experience with configuration frameworks for deployment: Ansible, Terraform, Chef, Puppet, Salt, etc.
Demonstrated proficiency with scripting languages such as: Python, Ruby or similar
Familiarity with source control systems such as Git.
Experience working with containers and container orchestration systems: ECS, Kubernetes, etc.
Experience with managing large numbers of servers in AWS and driving stability through automated monitoring, alerting, and actions.
Foundational understanding of network concepts and technologies.
Eager to learn and utilize new technologies, concepts and procedures as appropriate to project requirements.
Demonstrates awareness about competitors and industry trends, and can analyze impact of technology choices.


Search IT Jobs