Software Engineering Technical Lead

WEKA

WEKA provides a software platform that powers high-performance data infrastructure, enabling organizations to accelerate innovation with modern data architecture.

Tel Aviv, Israel Posted 2mo ago datacenterstoragesoftware

Apply directly →

At WEKA, we are building NeuralMesh™ — the world's first intelligent, adaptive mesh storage system, purpose-built for the age of AI. To ensure our platform remains unbreakable at the world's largest AI and GPU clusters, we don't simply test our code. We build an adversarial distributed system as complex and sophisticated as the product itself.

The Quality Testing & Reliability group is not a traditional QA team. We are a high-octane engineering force that treats reliability as a first-class software problem. We build the systems, frameworks, and infrastructure that prove our platform's correctness at scale — and we move with the urgency and ambition of a category-defining company.

We are looking for a Technical Lead to drive the architectural direction and engineering excellence of this group. This is a senior, deeply hands-on role for a technology leader who can own the technical roadmap, mentor a team of elite engineers, and build the infrastructure that challenges WEKA's platform to its theoretical limits.

What You'll Lead

Define and own the technical architecture of the group's distributed testing and reliability platform - designing for massive scale, real-world workload simulation, and adversarial failure injection
Lead effort involving multiple engineers, setting technical standards, running architecture reviews, driving design decisions, and mentoring engineers to grow
Build the systems that orchestrate millions of concurrent IO operations, inject chaos at the infrastructure layer (latency, packet loss, hardware failures), and expose the hardest-to-find race conditions and consistency bugs
Advance AI-driven approaches to test automation: intelligent scenario generation, LLM-augmented root-cause analysis, and autonomous validation pipelines
Drive observability and reliability engineering across the group - building telemetry pipelines that track P99 latency, jitter, and system health, turning quality into a quantitative discipline
Collaborate deeply with Core R&D, Storage Kernel, and Infrastructure teams - translating architectural knowledge into targeted reliability strategies
Establish engineering practices - design docs, production-grade code reviews, testing philosophy, and cross-team technical alignment

What You Bring

Strong software engineering background - Python expertise is essential; ability to read, debug, and reason about C++, Rust, or Go is a significant advantage
Deep understanding of distributed systems: concurrency, consistency models, fault tolerance, and large-scale system behavior under stress
Background in one or more of: storage systems, networking (TCP/IP, RDMA), cloud infrastructure, database internals, or high-performance backend systems
Experience building large-scale infrastructure platforms, internal developer platforms, or reliability engineering systems

Leadership

Proven track record leading complex technical initiatives from architecture through delivery
Experience mentoring and growing engineers - raising the technical bar of a team, not just directing work
Ability to drive technical alignment across teams, communicate tradeoffs clearly, and make high-quality architectural decisions at speed
Comfortable operating at both the strategic and hands-on level - you write code, review designs, and shape roadmaps
Previous experience in people management roles - Advantage

Mindset

You approach quality through the lens of Site Reliability Engineering: you care about MTTD, observability, and building self-healing systems
You have a "hacker" instinct - you don't just find bugs; you find the architectural flaws that allowed them to exist
You are an early adopter of AI tools and excited about applying LLMs and generative AI to accelerate engineering velocity

Big Advantages

Experience with storage systems, file systems, or high-performance distributed environments
Background in chaos engineering, fault injection, or simulation systems
Familiarity with observability tooling and performance engineering at scale
Experience building testing or reliability platforms as first-class engineering products
Prior experience as a Team Lead in a high-growth infrastructure company

Why This Role Is Different

Most engineering leadership roles manage delivery.

This role builds the system that proves the product.

You will lead one of the most technically demanding groups in the company - solving hard problems in distributed systems correctness, adversarial infrastructure design, and AI-augmented validation. You will have real influence on how one of the industry's most advanced storage platforms is hardened, scaled, and trusted by the world's leading AI organizations.

If you want to lead engineers who are building the future of infrastructure reliability - this role was built for you.

Posted by WEKA on their own careers page — you apply directly, no recruiter in between. View original / apply →