Apex Systems is looking for Reliability Engineers (RE) with a passion for system reliability to join our retail client's Reliability Engineering organization. As part of this engineering team, you will build reliability into our systems, infrastructure, and applications.
Our goals are ambitious and very focused on results and include user-interfacing applications, observability, production, excellence, reliability, error elimination, efficiency, and automation of manual and repetitive tasks.
The RE role provides an excellent opportunity to blend system design and software engineering skills with passion for troubleshooting and defects elimination to address an ever-changing applications and environments with scalability and reliability changes. This is an opportunity to join the journey and have a real impact on how to support customers and build software.
The RE will work with other Reliability Engineers, Product Managers, and Developers Practitioners to produce and ensure highest levels of availability and reliability of all customer facing websites, third party interfaces, and legacy application services. The RE is expected to work with management, peers, and customers to define and implement the technical vision, improve monitoring tools, error detections, defects elimination while improving Mean Time to Detection/Resolution, and overall service availability and customer satisfaction.
1. Troubleshooting high security e-commerce, infrastructure and legacy business applications/websites performance and availability issues and manages the incident life cycles to resolutions.
2. Lead root cause analysis/investigations through identifying, analyzing, and remediating service(s) performance and availability issues to ensure maximum service uptime and availability. Conducting Blameless Post Incident Review is expected.
3. Engage in and improve the whole life cycle of services- from inception and design, through deployment, operation, and refinement.
4. Maintain services once they are live by measuring and monitoring availability, latency, and overall system health. You're expected to be on-call and have strong written communication skills and be able to develop working relationships with co-workers.
5. Experience in balancing service reliability, metrics, sustainability, technical debt and operational toil for live services running at scale.
6. Work across multiple project teams simultaneously to support rapid development efforts.
7. Solve complex, critical issues that impact bottom line financial numbers and customer loyalty/experience.
8. Scale systems sustainability through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
9. Contribute positively to open source projects developed by the client and join existing communities. Navigate this broader ecosystem and structure projects with upstream/downstream opportunities in mind.
10. Identify and integrate with third-party solutions where it makes the most sense.
11. Use data to understand the availability, reliability, and sustainability of software.
12. Bring experience, pragmatism, and empathy, and composure to interactions with teams outside of the RE organization.
13. Work frequently with product teams on shared goals and cross-team projects.
14. Balanced planned and reactive work using basic project planning techniques and technical roadmaps.
15. Work and collaborate across teams such as Application Services, Capacity Planning, Hardware, Network, and Data Center Operations.
16. Participate in building advanced tooling for testing, monitoring, administration, and operations of multiple clusters across multiple environments.
17. Experience negotiating, SLIs, SLOs, and SLAs with product owners.
General Minimal Qualifications
- 3-5 years of applying reliability engineering principals to distributed services
- Understanding of and comfort with the GNU/Linux operating system
- Proficiency in high level languages such as Ruby, Python, and Bash
- Exposure to system level languages such as Go, C/C++
- Familiarity with configuration management software such as Puppet, Chef, Ansible, or Salt
- Source control, branching, and merging: git/svn/etc (Repository Management)
- Network Basics: TCP vs UDP, basic troubleshooting, HTTP - load balancing, firewall, private networks, multi-tier design, scale-out persistent data
- Databases - at a minimum understands the basics - select/insert
- Familiarity with standard infrastructure concepts like load balancers, firewalls, objects storage, and where/when they might be used
- Service Management - Incident Response, Change, and Problem Management
- Experience with Kubernetes and Docker
- Cloud Computing Concepts (not necessarily provider specific) - VMs vs Docker Containers, block storage vs object storage, infra automation vs install automation
- Experience operating a platform, software as a service, or shipping software
- Experience as an open source contributor
- Intellectual curiosity, problem solving and openness is key to its success. Mindset for solving production systems issues and understanding root cause while providing 'Detective Work' and automating away toil - doesn't like boring and repetitive tasks. Enjoys digging into new problems.
- Knows when to ask for help and when to dig more on their own
- Can work on different tasks in different systems week to week
- Capable of driving and focusing on results given in some cases given an ill-defined problem, such as 'this is slow' and developing metrics and making measurable improvements.
General Technical Summary
- Valuable Technologies like: WebSphere Commerce, WebSphere eXtreme Scale, WebSphere Application Server, WebSphere Message Broker, WebSphere MQ, Order Management, Web Services, Tomcat, Apace, TCP, UDP, Load Balancers, (Repository Management git/svn/), Puppet, Chef, Ansible, Salt, VM, Dockers Containers
- Valuable Methodologies like: ITIL, Agile, SCRUM, Reliability Engineering
- Valuable Databases/OS Systems Like: Oracle, DB2, SQL Server, Windows, UNIX, Linux, SYSTEMi
- Valuable Monitoring Tools like: IBM Monitoring, SCOM, CA Spectrum, AppDynamics, Soasta, Foglight
- Service Management Tools like: Remedy, Service Now, Jira, Pivotal Tracker, Xmatters
Apex is an Equal Employment Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, age, sexual orientation, gender identity, national origin, disability, protected veteran status, or any other characteristic protected by law. Apex will consider qualified applicants with criminal histories in a manner consistent with the requirements of applicable law. If you have visited our website in search of information on employment opportunities or to apply for a position, and you require an accommodation in using our website for a search or application, please contact our Employee Services Department at 844-463-6178