AI PM Interview Guide

AI/ML Literacy for the Interview Room

Interview Room

AI/ML Concept Pipeline

This chapter is not a textbook. I am not going to explain gradient descent, walk you through backpropagation, or give you a tour of transformer architecture. You do not need any of that to be an effective AI PM or to succeed in an AI PM interview. What you need is a precise, practical understanding of the concepts that come up in AI PM work every day — what they mean, what they cost, where they fail, and how a product manager’s decisions interact with them.

One framing before we begin. Traditional software behaves deterministically. Given the same input, it produces the same output, every time. AI systems do not. That single difference changes product design, QA philosophy, analytics, user trust, support models, rollout strategy, and roadmap planning. Everything downstream of that sentence is a consequence of it. Keep it in your head as you read this chapter.

Tokens and Context Windows

A token is roughly a word fragment — not a word, not a character, but something in between. “Unbelievable” might be three tokens. “The” is one. This matters because AI language models do not process text the way humans read it. They process sequences of tokens, and they have a fixed limit on how many tokens they can process at once. That limit is the context window.

Context windows have grown dramatically in recent years — from 4,000 tokens in early GPT-3 to 200,000 tokens or more in current frontier models. This sounds like a solved problem. It is not. How this shows up in an interview: when you are designing a feature that passes documents, conversation history, or retrieved content to a model, the question is not just “does it fit?” The question is how longer context affects cost, latency, and output quality. Longer context means more tokens processed, which means higher inference cost and higher latency. More insidiously, it can mean worse output quality — a phenomenon I will return to in the RAG section.

Inference

Inference is the process of running a trained model to generate an output. When a user sends a message to a chat product and the model responds, that response generation is inference. Inference is where the cost lives in production AI systems. Training happens once (or periodically). Inference happens on every user action, at scale, in real time.

How this shows up in an interview: any question about cost modeling for an AI feature is fundamentally a question about inference cost. When you are asked “how would you think about the economics of this feature,” you should be asking: how many inferences per user per day, how long are the inputs and outputs in tokens, what does the model provider charge per token, and what is the margin on the feature? You do not need exact numbers.

You need to demonstrate that you understand the cost driver.

Fine-Tuning vs. RAG vs. Prompting These are three different strategies for making a model behave the way you need it to for a specific product use case. Understanding when to use each is a real PM competency.

Prompting means giving the model instructions in the input — system prompts, few-shot examples, structured guidance about how to respond. It is the fastest, cheapest, and most flexible approach. Start here. How this shows up in an interview: if you immediately jump to fine-tuning as a solution without explaining why prompting was insufficient, you have revealed that you do not understand the cost gradient. Prompting should be your default until you have evidence it is not working.

Retrieval-augmented generation (RAG) means augmenting the model’s input with

information retrieved from an external store — typically a vector database containing chunked, embedded documents. The model is not retrained. It just gets relevant context at inference time. RAG is the right approach when the information the model needs is too large to fit in a prompt, changes frequently, or must come from a specific trusted source.

How this shows up in an interview: RAG questions are extremely common. You will be asked to design a RAG system, debug a RAG system, or evaluate a RAG system.

Understanding the retrieval layer — not just the generation layer — is the competency being tested.

Fine-tuning means taking a base model and continuing to train it on task-specific data to change its behavior more fundamentally. It is more expensive, slower to iterate, and harder to debug than prompting or RAG. It is appropriate when you need consistent stylistic behavior that prompts cannot reliably produce, or when you have a narrow, well-defined task with substantial labeled training data. How this shows up in an interview: the question is almost always “why fine-tune rather than RAG or prompt?” If you can articulate that answer specifically, you pass.

The RAG Context Lesson

I assumed for a long time that more context meant better output. More retrieved documents, longer retrieval windows, more information available to the model — it seemed like a straightforward quality improvement. Production disagreed.

In a production RAG deployment I worked on, increasing context window utilization did not improve answer quality. It degraded it. What happened: retrieval was surfacing a larger number of documents, some of which were relevant and some of which were marginally relevant or contradictory. The model, facing a larger and noisier input, began producing answers that reflected the noise. Conflicting information in the retrieved set led to hedged or internally inconsistent outputs. Ranking instability in the retrieval layer meant that what surfaced on one query was not consistent with a nearly identical query. And latency climbed in ways that were not visible in early testing but became a real user experience problem at scale.

The actual problem was not context size. It was retrieval precision. What we needed was not more documents — it was more relevant documents, fewer of them, ranked better. That realization fundamentally changed how I think about RAG architecture. The retrieval layer is not a commodity. It is where the quality lives. How this shows up in an interview: if you are asked to improve a RAG system and your only answer is “add more context,” you have demonstrated that you have not operated one in production.

Agents and Orchestration

An AI agent is a system where a model can take sequences of actions — not just generate a single response, but decide what to do next, call tools, retrieve information, and iterate toward a goal. Orchestration refers to the framework or logic that manages that sequence.

How this shows up in an interview: agent questions are increasingly common, especially at OpenAI and Anthropic. The PM-relevant questions are: what actions can the agent take, which of those actions are reversible, what oversight is in place when the model makes a mistake, and how does the system fail gracefully? These are product design questions, not engineering questions. Strong candidates frame agent design as a trust and safety problem, not just a capability problem.

Embeddings and Retrieval

An embedding is a numerical representation of a piece of text (or image, audio, etc.) that captures its semantic meaning. Similar content has similar embeddings. This is the foundation of retrieval in RAG systems — you convert documents and queries to embeddings and find documents whose embeddings are close to the query embedding. How this shows up in an interview: the relevant PM question is about retrieval quality.

Embedding-based retrieval is probabilistic — “close” in embedding space does not always mean “relevant” in the way a user intends. Understanding this is important for designing eval frameworks for retrieval systems.

Latency Types

There are two latency metrics that matter most for AI product design. Time to first token (TTFT) is how long the user waits before they see any output begin to appear. Tokens per second (TPS) is how fast the output streams after it starts. For interactive features, TTFT is often the more important user experience metric — users will tolerate slower generation if something starts appearing quickly. For batch processing or non-interactive features, total latency matters more. How this shows up in an interview: when you are designing a streaming chat feature versus a document analysis feature, the latency frame is different.

Showing that distinction demonstrates operational fluency.

Hallucination and What Causes It

Hallucination is when a model generates content that is plausible-sounding but factually incorrect or fabricated. The word “hallucination” has become overloaded — people use it to mean everything from factual errors to confabulation to confident wrongness. For PM purposes, the important thing to understand is that hallucination is not a bug in the sense of

a discrete defect that can be patched. It is a property of how these models generate text — by predicting the most probable next token, they can produce fluent, confident output that is simply wrong. How this shows up in an interview: any feature that surfaces factual claims to users — answers, summaries, recommendations, generated reports — requires you to have a position on hallucination risk and mitigation. That position involves eval design, confidence scoring, user interface choices (when to show uncertainty), and fallback behavior.

Temperature and Sampling

Temperature is a parameter that controls how “creative” or deterministic the model’s output is. Low temperature means the model consistently picks the highest-probability tokens, producing more predictable, conservative output. High temperature introduces more randomness, producing more varied, sometimes more creative, sometimes less coherent output. How this shows up in an interview: the relevant question is always product-specific. For a customer support feature where consistency and accuracy matter, low temperature is appropriate. For a creative writing assistant, higher temperature may be desirable. Knowing this and being able to reason about the tradeoff — not just recite the definition — is the signal.

You do not need to write code in an AI PM interview. But you do need to understand the consequences. The technical concepts in this chapter are not academic. They are the vocabulary of the tradeoff conversations that happen every day in AI product work, and they are what interviewers are listening for when they ask you to walk through a system or debug a failure.