Our team is looking for computer systems engineers who:
- Are well versed in Linux based computing environments.
- Are curious, tenacious, and empathetic.
- Have a proven track record of mastering complex technologies.
A typical work day might be spent contributing with our small team to a plan for deploying a new medical research tool in Azure, or writing puppet code to prevent code drift on the network storage devices in a GPFS cluster, or writing Ansible playbooks to deploy development environments in VMware, or building Docker containers that provide the runtime for a node in a complex computing workflow, or troubleshooting high latency in a network path connecting a sequencing platform to a storage platform.
Once every several weeks for one week you will take the on-call role, where you will respond to PagerDuty alerts that will direct you to written runbooks in our Operations Manual to mitigate service outages. Very few, if any, of these are "after hours" at this time, though they may become so in the future. We use Google's Site Reliability Engineering book as inspiration for our operations. We strive to measure and control "toil" so that team members have an equal share in the run/fix work of our operations. We track our time, and our metrics reveal that we spend about 50% of our time on "development" projects, 25% of our time on run/fix, and 25% of our time on "administration" which includes training, learning, PTO, etc
This job will have the following responsibilities:
- Perform research and analysis and apply best practices to design and implement technological solutions in support of a defined set of research infrastructure services. Work with a team of engineers to build new and improve existing services.
- Respond to and resolve incidents escalated from operational teams and performance tuning requests utilizing critical thinking skills. Operational support.
- Participate in non-project efforts including HR and administrative requirements, training and skill development, and other duties as assigned.
- Associates or Bachelor’s degree plus 3 years of related experience or equivalent combination of education and experience.
- Experience working in scientific research.
- Professional certification.
- Experience in enterprise-scale Linux environments.
- Key technologies include: Linux system administration, the Atlassian suite, git, bash, perl, python, ruby, go, vmware, rhev, openstack, docker, postgresql, mysql, puppet, ansible, jenkins, graphite, grafana, logstash, elasticsearch.
- Our computing environment strives to embrace modern principals and tooling including continuous integration and deployment, automation frameworks, configuration management, virtualization and container technologies. Our technical stack is diverse, offering exposure to a wide array of professional opportunities. It matters less what specific skills or technologies you already know, and more that you have proven the ability to learn on the job, master new skills, and work with the team. However, the list of technologies we embrace might serve to provide a window into our operating environment and what you might work with if you join our team.
- High performance or high throughput computing with IBM Platform and Spectrum products.
- Computing runtimes in Docker containers.
- Container frameworks soon to include Kubernetes.
- Virtualization environments including VMware and Openstack.
- Relational database services based on PostgreSQL and MySQL.
We intend to develop a hybrid cloud environment to link environments with the three major cloud providers, AWS, GCE, and Azure, so experience with those environments is welcome. You will need to be comfortable interacting with code written in several languages and will need to be able to write code in a high level language like Ruby, Python, or Perl.