Senior Software Engineer II - Applied AI and Evaluations (Remote Eligible)

-REMOTE, USA-

For over 20 years, Smartsheet has empowered teams to manage work seamlessly and scale solutions smarter. Now, in our most ambitious chapter yet, we are uniting human teams with AI agents. By orchestrating the work agents do best, automating manual tasks and uncovering insights at scale, we create the space for people to focus on what truly matters: judgment, creativity, and big thinking. That is magic at work, and it's what we show up for every day. Smartsheet is building the next generation of AI-powered work management through SmartAssist, our intelligent agent platform. As we scale from early demos to production-grade agents, quality is the critical frontier and we are looking for a Agent Quality Engineer to own it.

This is not a QA role. It's a deeply technical, high-autonomy position at the intersection of LLM evaluation, prompt and context engineering, and retrieval-augmented generation. You will diagnose why our agents fail, design the systems that catch regressions, and drive measurable improvements across our orchestrator and subagent fleet.

You will work closely with our Agent Engineering and AI Platform teams, embedded in a team that has already shipped evaluation infrastructure on Databricks/MLflow and is building toward a mature Agent Development Lifecycle (ADLC).

You Will:

Own agent quality end-to-end: diagnosis, improvement, and validation across SmartAssist's orchestrator and subagents

Identify failure modes across quality dimensions factual accuracy, completeness, tone, actionability, and latency and prioritize what to fix

Drive quality improvements through prompt engineering, context engineering, and RAG retrieval tuning

Extend and mature our evaluation framework: scorers, golden datasets, regression gates, and online evaluation for production traffic

Close the feedback loop ensure that every change has a measurable, attributable quality signal

Collaborate with our Agent Architecture lead to distinguish quality problems that require prompt/context solutions from those that require structural fixes

Establish repeatable methodology that scales beyond any single agent or subagent

You Have:

Required

8+ years of software engineering experience, with at least 2 years working directly with LLMs in production

Deep, hands-on experience with prompt engineering and context engineering, you understand how model behavior changes with framing, structure, and input design

Strong working knowledge of RAG architectures: chunking strategies, embedding models, retrieval evaluation, and failure diagnosis

Experience building or extending LLM evaluation frameworks, you have designed scorers, worked with golden datasets, and thought carefully about what good looks like

Fluency in agent system design, you don't need to own the architecture, but you can engage as a peer on architectural tradeoffs that affect quality

Strong Python skills; comfortable working in data-heavy environments (Databricks, Delta tables, or equivalent)

Ability to communicate complex quality findings (written and verbal) to both technical and non-technical stakeholders, you can explain what's broke, why it matters, and what needs to happen next without losing the room

Strong cross-functional judgment, you know when to escalate, when to resolve independently, and how to build credibility across engineering, product, and AI platform teams

A bias for clarity in ambiguous situations, when failure modes are murky and trade-offs are real, you bring structure and a clear point of view rather than waiting for consensus

Legally eligible to work in the U.S. on an ongoing basis

BS or MS in Computer Science, a related field, or equivalent industry experience

Strong Plus

Experience with MLflow or similar experiment tracking platforms

Familiarity with CI-integrated evaluation pipelines

Experience with multi-agent orchestration frameworks

Prior work in an Applied AI or LLMOps function within a product company

What Success Looks Like:

In your first 6 months, you will have:

Delivered measurable, validated quality improvement on at least one SmartAssist agent across our defined quality dimensions

Expanded evaluation coverage to close the most significant blind spots in our current framework

Established a repeatable quality improvement methodology the broader team can apply

Why This Role:

You will have real ownership, not a supporting role on someone else's roadmap

SmartAssist is shipping to real users; this work has immediate, visible impact

You will be part of a fast-moving team that deeply values engineering rigor and actively seeks diverse perspectives

Current US Perks & Benefits:

Employer subsidized medical/vision and dental coverage for full-time employees

401k Match to help you save for your future (50% of your contribution up to the first 6% of your eligible pay)

Monthly stipend to support your work and productivity

Flexible Time Away Program, plus Sick Time Off

US employees are automatically covered under Smartsheet-sponsored life insurance, short-term, and long-term disability plans

US employees receive 12 paid holidays per year

Up to 24 weeks of Parental Leave

Personal paid Volunteer Day to support our community

Opportunities for professional growth and development including access to Udemy online courses

Company Funded Perks, including a counseling membership, local retail discounts, and your own personal Smartsheet account

Teleworking options from any registered location in the U.S. (role specific)

Smartsheet provides a competitive base salary range for roles that may be hired in different geographic areas we are licensed to operate our business from. Actual compensation is determined by several factors including, but not limited to, level of professional, educational experience, skills, and specific candidate location. In addition, this role will be eligible for a market competitive incentive opportunity. US Base Salary Pay Range $175,000 — $245,000 USD

Get to Know Us:

At Smartsheet, your ideas are heard, your potential is supported, and your contributions have real impact. You'll have the freedom to explore, push boundaries, and grow beyond your role. We welcome diverse perspectives and nontraditional paths—because we know that impact comes from individuals who care deeply and challenge thoughtfully. When you're doing work that stretches you, excites you, and connects you to something bigger, that's magic at work. Let's build what's next, together.

Equal Opportunity Employer:

Smartsheet is an Equal Opportunity (EEO) employer committed to fostering an inclusive environment with the best employees. It is our policy to provide equal employment opportunities to all qualified applicants in accordance with applicable laws in the US, UK, Australia, Germany, Costa Rica, Japan, Bulgaria, and India. All qualified applicants will receive consideration without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, protected veteran or disabled status, or genetic information.

If there are preparations we can make to help ensure you have a comfortable and positive interview experience, please let us know.

#LI-Remote

Required

8+ years of software engineering experience, with at least 2 years working directly with LLMs in production

Deep, hands-on experience with prompt engineering and context engineering, you understand how model behavior changes with framing, structure, and input design

Strong working knowledge of RAG architectures: chunking strategies, embedding models, retrieval evaluation, and failure diagnosis

Experience building or extending LLM evaluation frameworks, you have designed scorers, worked with golden datasets, and thought carefully about what good looks like

Fluency in agent system design, you don't need to own the architecture, but you can engage as a peer on architectural tradeoffs that affect quality

Strong Python skills; comfortable working in data-heavy environments (Databricks, Delta tables, or equivalent)

Ability to communicate complex quality findings (written and verbal) to both technical and non-technical stakeholders, you can explain what's broke, why it matters, and what needs to happen next without losing the room

Strong cross-functional judgment, you know when to escalate, when to resolve independently, and how to build credibility across engineering, product, and AI platform teams

A bias for clarity in ambiguous situations, when failure modes are murky and trade-offs are real, you bring structure and a clear point of view rather than waiting for consensus

Legally eligible to work in the U.S. on an ongoing basis

BS or MS in Computer Science, a related field, or equivalent industry experience

Strong Plus

Experience with MLflow or similar experiment tracking platforms

Familiarity with CI-integrated evaluation pipelines

Apply with uptayn.

Sign in free to open the apply link, get this role scored against your CV, and track your application.

uptayn
2026 · built quietly in Berlin.
uptayn = up + attain
Built for
  • Recent business grads
  • Engineers pivoting to ops
  • Consultants → startup
  • Second-job operators
Quiet by default
  • No tracking pixels
  • No LinkedIn login
  • No spam outreach
  • Just roles + your CV