Job Summary: Rackspace seeking a highly skilled and motivated HPC System Engineer to join our team. You’ll be responsible for working directly for one of flagship clients and designing, implementing, maintaining, and optimizing their high-performance computing (HPC) infrastructure. You will work closely with researchers, scientists, and other engineers to ensure the efficient and reliable operation of the HPC systems.
Work Location: 100% Remote. Due to this role supporting a customer in the Seattle area we prefer to hire in either PST or MST time zones.
Travel: There may be minimal travel to either San Antonio, TX or Seattle WA.
Responsibilities:
- Install, configure, and maintain HPC clusters, including hardware and software components.
- Monitor system performance, identify bottlenecks, and implement solutions to optimize performance.
- Manage user accounts, permissions, and resource allocation.
- Perform regular system maintenance, updates, and patching.
- Troubleshoot and resolve hardware and software issues in a timely manner.
- Participate in the design and planning of HPC infrastructure upgrades and expansions.
- Evaluate and recommend hardware and software solutions to meet evolving computational needs.
- Implement and manage storage systems, networking infrastructure, and interconnects (e.g., InfiniBand).
- Optimize system configurations and application performance for HPC workloads.
- Profile and analyze application performance to identify areas for improvement.
- Implement and utilize performance monitoring tools and techniques.
- Provide technical support and training to HPC users.
- Collaborate with researchers and scientists to understand their computational requirements.
- Work closely with HPC architects and engineers to ensure that research needs are met.
- Document system configurations, procedures, and best practices.
- Assist HPC engineers and architects with day-to-day operations and ticket management.
- Implement and maintain security measures to protect HPC infrastructure and data.
- Ensure compliance with relevant security policies and regulations.
- Manage data backups and disaster recovery procedures.
Qualifications:
- Bachelor’s degree in computer science, engineering, or a related field. Experience may substitute for the degree.
- Minimum of 10 yrs experience working with systems; 5yrs specifically with HPC.
- Strong knowledge of Linux operating systems (e.g., Rocky, Ubuntu).
- Experience with cluster management tools (e.g., Slurm, PBS).
- Familiarity with high-speed interconnects (e.g., InfiniBand, Ethernet).
- Knowledge of parallel file systems (e.g., Lustre, SEPH, GPFS).
- Proficiency in scripting languages (e.g., R, Python, Bash).
- Understanding of HPC hardware architectures and technologies (e.g., CPUs, GPUs, memory).
- Strong demonstrated experience with a major configuration management software (e.g. Terraform, Ansible), including application packaging and installation.
- Must have strong knowledge of Linux security and Linux shell scripting.
- Strong communication and interpersonal skills.
- Knowledge of data transfer protocols and large-scale storage solutions.