Supercomputing Engineer (Test)
Company: Etched
Location: San Jose
Posted on: April 3, 2026
|
|
|
Job Description:
About Etched Etched is building the world’s first AI inference
system purpose-built for transformers - delivering over 10x higher
performance and dramatically lower cost and latency than a B200.
With Etched ASICs, you can build products that would be impossible
with GPUs, like real-time video generation models and extremely
deep & parallel chain-of-thought reasoning agents. Backed by
hundreds of millions from top-tier investors and staffed by leading
engineers, Etched is redefining the infrastructure layer for the
fastest growing industry in history. Job Summary We are seeking
highly motivated and detail-oriented Supercomputing Engineer (Test)
to join our team. This team plays a critical role in ensuring the
reliability and stability of our highest-performance Inference
server hardware and software. As a Software Engineer on this team,
you will design, develop, and execute comprehensive burn-in test
suites, analyze test results, and collaborate with hardware and
software engineering teams at Etched and our ODM partners to
identify and resolve potential issues. You will be at the forefront
of ensuring our server products meet the highest quality standards
before they reach our customers. Key responsibilities Design,
develop, and implement automated burn-in test suites using common
scripting languages (Python, Go, Bash) and test frameworks across
all aspects of System Operation including: boot sequences,
root-of-trust, system management, workload deployment and
performance. Execute burn-in tests on server hardware, monitor
system performance and health, and analyze test results.
Investigate and debug hardware and software failures identified
during testing, providing detailed reports and mitigation plans.
Collaborate with internal and external hardware and software
engineering teams to identify root causes of failures and implement
corrective actions. Contribute to the development and maintenance
of the burn-in testing infrastructure, including portable test
environments and automation tools runable in any environment.
Create and maintain comprehensive documentation for test plans,
test cases, and test results. Analyze system performance metrics to
identify potential bottlenecks and areas for optimization.
Participate in continuous improvement efforts to enhance the
efficiency and effectiveness of the burn-in testing process.
Representative projects Develop automated test suites to
stress-testing of CPUs, memory, storage, and network subsystems
under extreme workloads. Design and implement fault injection tests
to simulate hardware and software failures. Create tools to monitor
and analyze system performance metrics, such as CPU utilization,
cross-socket memory performance and usage, and network latency.
Build and maintain a scalable burn-in testing environment capable
of handling multiple server configurations. Collaborate with
hardware engineers to develop tests for new server features and
components. Contribute to the creation of dashboards that show the
current state of burn in testing across the server farm. You may be
a good fit if you have Proficiency in at least one scripting
language (e.g., Python, Bash, Go). Experience with software testing
methodologies and tools. Strong understanding of operating systems
(Linux preferred) and server hardware architectures. Ability to
analyze complex technical problems and provide effective solutions.
Excellent communication and collaboration skills. Ability to work
independently and as part of a team. Experience with version
control systems (e.g., Git). Experience with reading and
interpreting hardware logs. Strong candidates may also have
experience with (Nice to have qualifications) Experience with
hardware burn-in testing or reliability testing. Knowledge of
server virtualization and cloud computing concepts. Experience with
performance testing and benchmarking tools. Familiarity with
hardware diagnostic tools and techniques. Experience with
containerization technologies (e.g., Docker, Kubernetes).
Experience with CI/CD pipelines. Knowledge of low level hardware
communication protocols (i2c, etc.) Experience with data analysis
tools and techniques. Candidates with experience in server hardware
or software development, testing, or support. A strong interest in
hardware and software reliability. A background in system
administration or performance engineering. Working in a fast-paced
and challenging environment. Working in a datacenter environment.
Telecommunications or high performance computing fields. Benefits
Medical, dental, and vision packages with generous premium coverage
$500 per month credit for waiving medical benefits Housing subsidy
of $2k per month for those living within walking distance of the
office Relocation support for those moving to San Jose (Santana
Row) Various wellness benefits covering fitness, mental health, and
more Daily lunch dinner in our office How we’re different Etched
believes in the Bitter Lesson . We think most of the progress in
the AI field has come from using more FLOPs to train and run
models, and the best way to get more FLOPs is to build
model-specific hardware. Larger and larger training runs encourage
companies to consolidate around fewer model architectures, which
creates a market for single-model ASICs. We are a fully in-person
team in San Jose (Santana Row), and greatly value engineering
skills. We do not have boundaries between engineering and research,
and we expect all of our technical staff to contribute to both as
needed.
Keywords: Etched, Mountain View , Supercomputing Engineer (Test), Engineering , San Jose, California