Loading video player...
One of the most disappointing
realizations for beginner AI engineers
is that the projects they've done while
learning won't actually help them land a
job. These standard chatbot tutorials
and rag followalongs simply aren't
complex enough or close enough to what
we do in industry to show employers that
you have the skills they need. The
problem is as a beginner, you've never
seen a production AI system. You don't
know what you need to build. So you do
the next best thing and follow example
projects you find online. And now you
have the exact same chatgbt rapper as
everyone else. Instead, what you need is
a framework for how to build real
endtoend AI applications that are unique
and demonstrate actual engineering
skills. In my years of sitting on hiring
panels at companies like Amazon and in
my career coaching work, these kinds of
robust, well-engineered projects are the
ones that actually make a difference in
your career. Today, we're breaking down
that framework. I'm not sharing example
projects to follow, but instead a
step-by-step guide on everything you
need to think about and include to build
something that is legit. Let's get
started. Here are the eight core
components we're going to discuss.
Problem framing and success metrics.
What are we actually solving and how do
we know if we've solved it? Prompt
engineering and systematic tracking. How
to version, evaluate, and improve
prompts like actual engineering work.
Model selection and evaluation. A
systematic way to compare different
models and make informed decisions. Rag
building a retrieval system with smart
chunking, ranking, and evaluation. Agent
systems with a solid understanding of
design, security, and error handling.
System monitoring and error analysis.
tracking performance at every level and
actually learning from failures.
Deployment and user interface, making
your AI system accessible and reliable.
And as a bonus, fine-tuning. This is
less common in industry, but it's still
a super valuable skill to demonstrate.
At the end, I'll share some learning
resources that can help you dig into
details on everything we talked about
today. All right, let's get into it.
Starting with the thing most people
completely skip, problem framing and
success metrics. I know that sounds
really basic. Obviously, you need to
know what problem you're solving, right?
But you'd be surprised at how many
projects completely skip this step. It's
easy to get excited about a cool new
tool and try to find some way to use it,
which is fine as long as you can solve
something important along the way. But
really, this is the opposite of how we
approach projects in industry. In
reality, we start with a business
problem and then we figure out if an LLM
is even the right tool to solve it. And
if it is, we define very clearly what
solved means. Once you've decided an LLM
is the right approach, you need to think
about the constraints that are specific
to AI engineering. This might include
things like latency and cost constraints
as well as a certain quality bar you
need to reach. By this I mean what does
good enough actually mean for your use
case. If you're building a medical Q&A
system, you need extremely high accuracy
and you probably need citations. If
you're building a creative writing
assistant, users might be fine with
outputs that are 80% useful and they'll
edit the rest. Measuring quality can be
tricky, though. Defining success metrics
for AI systems is different from
traditional ML because you're often
dealing with more subjective outputs.
You need metrics at multiple levels.
Userfacing metrics answer, "Is this
actually useful?" These metrics might
include task completion rate or user
satisfaction ratings. Technical metrics
answer, "Is the AI performing well?" For
these, you might measure response
quality using an LLM as judge, human
ratings, or accuracy on test cases. And
system metrics answer, is this
sustainable? These include latency, how
fast are the responses, cost per
request, can we afford to run this at
scale? Error rate, how often does the
system fail? and uptime. Is a service
actually available for users? The key is
that these metrics need to be measurable
and you need to track them over time. As
you work on scoping your project, make
sure you write down the decisions you
made and why. This can be as simple as
adding it to a readme, but you'll
definitely want to document so you're
prepared in interviews. All right, now
that we know what problem we're solving
and how we're going to measure if we've
solved it, let's talk about a component
that separates amateurs from
professionals. Prompt engineering and
systematic tracking. One thing that's
not clear from most explanations of
prompt engineering is that it's less art
and a lot more systematic. Here are some
simple things you can do to improve the
quality a ton and have your project
stand out. First, treat your prompts
like separate components of your system.
Don't hard- code them directly in files,
but instead have separate files for each
version of your prompt. Something like
this. Next, build a structured
evaluation framework. Create a set of
test inputs with the expected outputs or
at least criteria for what a good output
looks like. A good test set should cover
different types of questions, difficulty
levels, and edge cases, and it should be
updated regularly based on new failure
patterns you discover as you work.
Ideally, you'd want one to 300, but
start with what you can get. Then, when
you change your prompt, run it against
the test set to measure performance. You
can use metrics like blue and rouge,
exact matching for classification tasks,
or LLM as judge, which is where you use
a strong model like GPT5 to evaluate the
quality using a consistent rubric. For a
simple project, you can do your prompt
evaluation more manually. But there are
also tools like prompt layer, length,
fuse, and weights and biases that can
help you with prompt versioning and
evaluation tracking. All right, so now
you have a way to do systematic prompt
engineering. But which model should you
actually use for your project? That's
what we'll need to figure out next.
Model selection and evaluation. A common
beginner mistake I see is to just
default to using the newest model for
everything because it's clearly the best
model, right? But is it actually best
for your specific use case? Is it worth
the cost? Is it fast enough? Would a
smaller, cheaper model work just as
well? You don't know unless you actually
test it. In industry, we're constantly
evaluating different models because cost
and performance matter. A model that's
twice as expensive but only 5% better
might not be worth it. I have an entire
video on how to do model selection in
detail, but at a minimum, you'll want to
run a few models from different
providers and different sizes on your
test set and then compare both
performance and cost using some
consistent rubric and metrics. As usual,
make sure to document your process in
your readme. It might be that you find
one model is the right balance between
cost and performance. Or maybe you need
a combination. Model routing is a more
advanced method that works well in
practice. This is when you use a cheap,
fast model to classify the difficulty of
the incoming query, then route it to the
appropriate model from there. Simple
queries can go to a cheap and fast model
and complex queries can go to a more
expensive higher quality one. All right,
so now you know how to systematically
choose models. Next up is one of the
most important components of modern AI
engineering systems, Rag. But before we
get into Rag, I want to talk about
something that's essential for building
real AI systems, especially when we get
into agents in a few minutes. Access to
live data from the web. Here's the
thing. If you're building an agent or
any application that needs current
information, you need a way to access
real-time data from the web. And web
scraping is a pain in the butt. You have
to deal with captions, rate limits,
changing HTML layouts, proxies, lots of
annoying stuff that's a distraction from
your project work. SER API handles all
of that for you. They provide clean,
structured search results from Google,
YouTube, Bing, and more through a simple
API. You make one API call and get back
a clean JSON object with exactly the
data you need. This is super helpful for
AI engineering projects. Want to build
an agent that can search for current
information? Use search API's Google
search API. Training a model and need a
data set of classified images. Their
Google images API gives you
pre-classified image titles, URLs, and
thumbnails. Building a research
assistant. Their Google Scholar API
gives you access to peer-reviewed
articles with full metadata. We're
actually going to talk about agents in a
few minutes, and many production agent
systems use SER API as one of their core
tools. You can get started with 250 free
credits by clicking the link in the
description or scanning the QR code on
screen. Thank you so much to SER API for
sponsoring this video. All right, now
let's talk about Rag. Rag stands for
retrieval augmented generation, which is
when we connect an LLM to a data source
like our company database. The choices
you make about chunking, embeddings, and
retrieval strategies have a massive
impact on your systems performance. And
showing that you actually thought about
these decisions is what separates your
project from one of the dozens of RAG
tutorials out there. Let me break down
what you need to focus on. Chunking is
when we break up documents into chunks.
This is one of the most important
decisions in your RAG pipeline. There's
no universal right answer on how to do
this. You can try things like fixed size
chunking at different sizes, semantic
chunking where we split by meaning, or
many other options. Test different
chunking strategies on your specific use
case. For each strategy, measure
retrieval accuracy. Again, we're going
to create a test set this time of 20 to
30 questions where you know which
sections of your documents contain the
right answer. Then you measure what
percentage of the time your retrieval
system actually returns those sections.
The next decision is what embedding
model to use and what vector database to
store the embeddings in. Embeddings are
numerical representations of your
documents. To transform text into
embeddings, we use a pre-trained
embedding model most of the time. Open
AAI's text embedding models are solid
and really easy to use. Open source
alternatives like sentence transformers
let you run models locally for free.
Again, try different ones and test them
out. Once you've made these embedding
vectors, you need to store them
somewhere. For vector storage, you can
start working with cloud platforms right
away, which I definitely recommend. and
try out AWS Open Search or something
similar on GCP or Azure. Once you've
created your embeddings and stored them
in a vector store, you need some way for
the system to find the relevant
embeddings. This is the search step. The
simplest approach is semantic similarity
search. You convert the user's question
into an embedding as well. Then you find
the chunks with the most similar
embeddings. That's your baseline. Hybrid
search combines keyword search with
semantic search. Maybe a chunk isn't
semantically similar to the query
embedding, but contains the exact
technical term the user mentioned. In
that case, hybrid search would catch
that. Most modern vector databases
support this right out of the box. Once
you have your search strategy, there are
a couple more things you can do to
improve your results. Reranking is a
two-stage approach. First, you retrieve
maybe 20 chunks using a fast vector
search. Then, you use a more accurate
but slower model to rerank those 20
chunks and pick the best five. This is
common in production systems because it
balances speed and accuracy. Query
expansion means you don't just search
what the user's exact question was. You
might use an LLM to rephrase the
question to be a bit more specific,
generate multiple variations, or extract
key entities and concepts. For example,
if a user asks, "What's its population?"
after previously asking about Paris, the
query might be expanded to say, "What's
the population of Paris?" You don't need
to implement all of these for a
portfolio project, but testing at least
one or two and documenting the
improvement shows that you understand
retrieval is more than just basic
similarity search. Once you've put
together all these components, it's time
to evaluate your rag pipeline. We want
to not only evaluate the quality of our
entire system, but also each component.
Here are some metrics to consider.
Retrieval accuracy. What percentage of
the time do you retrieve the correct
chunks? Use metrics like precision at K
or recall at K. Answer accuracy given
good chunks. Does the LLM produce the
correct answer when it receives the
right context? This isolates whether
your LLM is the problem or your
retrieval is the problem. End toend
accuracy. Does the whole system work? By
measuring separately, you can diagnose
exactly where failures happen and
hopefully fix it. All right, so now you
have a solid rag system. But what if you
need your AI to actually do things, not
just answer questions? That's where
agents come in. An agent is basically an
LLM that can use tools and take actions
autonomously to accomplish a goal.
Instead of just answering a question, it
might search the web, run code, query a
database, call an API, and synthesize
the results. For a portfolio project,
including an agent component shows you
can build complex autonomous systems. I
have a really comprehensive course on
agent systems that goes into all the
details. I'll add a link to that. But
for now, we'll just cover this at a high
level. For your project, you can use a
framework like Langraph or Crew AI, or
you can build your own agent system
using function calling features built
into OpenAI or Anthropics APIs. The
important thing is that you understand
what's happening under the hood, not
just that you copied example code. And
you've thought about the things that
separate toy projects from production
systems. Things like error handling. LM
make mistakes. Your agent needs to
handle this gracefully. Also security.
If your agent can execute code or call
APIs, you have security concerns.
Monitoring agent behavior. Log
everything your agent does, what tools
it called, what the outputs were, and
what decisions were made at each step,
how long each step took, and where it
succeeded or failed. This lets you debug
when things go wrong and improve your
agent over time. As usual, we need to
test this component of our system.
Agents are hard to test because they're
non-deterministic and multi-step. At a
minimum, you need unit tests for
individual tools, integration tests for
complete workflows, and some adversarial
testing to see what happens if a user
asks for something malicious. Create a
test set with at least 10 to 15
representative tasks from simple to
complex. Measure task completion rate,
and average steps to completion. All
right, so you've built all these cool
components. You have prompts, rag, and
maybe an agent. But if users can't
actually interact with your system, it's
not really a complete project.
Deployment and UI are where you make
your AI system something that's actually
useful. The most professional approach
to deployment is building a REST API
that serves your AI system. Fast API is
the standard choice for Python. It's
fast, it has automatic documentation,
and it handles async operations well,
which is important for LLM calls that
can take several seconds. The key things
to handle are streaming responses so
your UI feels faster, error handling,
because again, LLM APIs fail sometimes,
rate limiting to prevent abuse, and
authentication with at least a simple
API key. Then you need to put it
somewhere people can actually use it,
which usually means hosting your fast
API app on something like AWS or GCP so
it has a stable public URL. Once it's
live, you could create a UI with
Streamlit or Graddio for a really simple
setup or React or Nex.js app if you want
it to look a little bit more like a
product. Okay, so let's say you've
deployed your project and people are
using it. Hooray. But now you need to
know if something breaks. That's where
monitoring comes in. Monitoring is what
separates a demo from a real system. And
for a portfolio project, having even
basic monitoring is a really strong
signal. Here's what to monitor for each
component. For prompts, track response
quality scores, format compliance,
refusal rates, and average response
length. If your refusal rate suddenly
spikes from 2% to 15%, something changed
and you need to investigate that. For
RAG, track retrieval confidence scores,
number of chunks retrieved, source
diversity, and retrieval latency. If
confidence scores are dropping over
time, your knowledge base might be
getting stale, or maybe user queries are
shifting in some way. For agents, track
task completion rate, average steps to
completion, tool success rates, error
types, and cost per task at a minimum.
There's a lot with agents. If your agent
is suddenly taking 15 steps to complete
tasks that used to take five steps, for
example, that means something's wrong.
For the overall system, track end to end
task success, user satisfaction,
latency, cost per request, error rate,
and uptime. In order to monitor all of
this, you need a logging system. This is
when we track what happened with each
request. At a minimum, log the time
stamp, user query, what components were
used, things like which chunks were
retrieved, which model was used, what
prompt version, etc. The response,
latency, cost, and any errors that
occurred. For a simple portfolio
project, write these logs to a file or
like a SQLite database. For something
more advanced, you can use a proper
logging tool or service. All right, so
now you've got monitoring place and a
working system. Now, let's talk about
fine-tuning, which is more advanced, but
can really improve the performance of
your project. You might be wondering why
I put fine-tuning all the way at the end
in a bonus section. The reason is
fine-tuning is actually less common in
industry than people think. Most of the
time, a well-gineered prompt gets you
90% of the way there. Fine-tuning is
only worth it when you've already
optimized everything else and you need
that little extra performance boost or
when you have very specific requirements
that prompting can't solve. Here are
some realistic use cases for fine-tuning
in your project. Consistent output
formatting. If you need your model to
always output a very specific JSON
structure and prompting isn't reliable
enough, fine-tuning can help. matching a
larger model's performance with a
smaller model. Maybe Sonnet 4.5 gives
you great results, but it's too
expensive or slow. You could fine-tune
Haiku 3.5 on examples from Sonnet 4.5 to
get similar quality at a lower cost.
Domain specific language. If you're
working with specialized terminology
like medical, legal, or technical
documentation, fine-tuning on domain
specific examples can improve
performance. And for RAG systems, you
can fine-tune your embedding model on
domain specific data to improve
retrieval quality. The key is to only
fine-tune if you have a clear reason and
you've already tried everything else. If
you do decide to fine-tune, here's how
to do it properly. You're going to need
to create a highquality training data
set. This is the most important part.
Your fine-tuned model is only as good as
your training data. You need at least 1
to 500 example inputs and ideal outputs.
These should be real examples from your
use case. Then establish a baseline
before you fine-tune. Measure
performance with your best prompt on a
held out test set. This is your baseline
to beat. Train multiple versions. Train
with different amounts of training data,
different numbers of epochs, different
learning rates, and track everything.
Finally, compare performance. Does the
fine-tuned model actually beat your
baseline? By how much? Is the
improvement worth the effort and cost?
All right, so we've covered all eight
components. Now, let's talk about how
they all fit together and what your
final project should look like. Let me
show you how all the pieces we've
discussed fit together. A user submits a
query through your UI. The request hits
your API and gets logged with metadata.
If you have an agent, it decides whether
it needs rag, which tools to use, and
what actions to take. If rag is needed,
the query gets converted to an
embedding. relevant chunks are retrieved
and reranked and those chunks become
context. Your prompt template gets
filled with that context, examples, and
instructions. The appropriate model
based on your model selection logic
generates a response potentially with
streaming. The response gets validated,
post-processed if needed, and returned
to the user. Everything gets logged. The
query, retrieve chunks, model use,
latency cost, any errors, all that
stuff. Your monitoring system analyzes
these logs regularly. You review
failures, categorize error types, and
use that data to improve your prompts,
rag configurations, or agent logic, and
the cycle continues. Your system gets
better and better over time based on
real usage data. But remember, you don't
have to build all of this at once. Start
simple. Build a basic version with one
prompt and basic rag. Get that working.
Then add complexity. Try different
chunking strategies. Add an agent.
Implement monitoring or optimize your
model selection. Measure the impact of
each change. Don't just add features
because they sound cool. add them
because they solve a specific problem or
improve a specific metric. Projects like
this can take weeks or months, and
that's totally normal. I'd much rather
see one really solid projects than five
half-finish kind of shitty ones. It's
definitely a quality over quantity
situation. If you want to learn more
about anything we discussed today, some
recommendations that I have for learning
would be the book AI engineering by Chip
Huen or I have many different videos
covering these topics, which I'll link
in the description. So, that wraps up
our project template. Like I said, I
have lots of videos that can help break
down each of these components, or check
out my AI agents course up next. Thank
you so much for watching, and I'll see
you next time.
Check out SerpAPI and start building production-ready projects (agents!!!) today: https://serpapi.link/marina-wyss AI Engineering book: https://amzn.to/44Nd7vK Timestamps ⏰ 00:00 How to make real AI Engineering portfolio projects 01:47 Problem framing and success metrics 03:41 Professional prompt engineering 04:57 Model selection and evaluation 07:15 RAG 10:41 Agents 12:09 Deployment and UI 13:09 Monitoring 14:32 Fine-tuning 16:12 The complete system 17:37 Recommended learning resources ---------------------------------------- Want to become an AI Engineer? Download my AI Engineering Skills Checklist here: https://www.gratitudedriven.com/c/ai-engineering-checklist 💬 Want to talk 1:1? Book time to chat with me here: https://topmate.io/marina_wyss 💀 Follow my second channel for more on mindset, productivity, and meaning: https://www.youtube.com/@GratitudeDriven 🩷 Join the channel membership community for priority comment replies and early access to videos! https://www.youtube.com/@MarinaWyssAI/join ☕ If you'd like to support my work, you can buy me a coffee (thank you!): https://ko-fi.com/marinawyss ---------------------------------------- 🎥 Other videos you might like: AI Agents in 38 Minutes - Complete Course from Beginner to Pro https://www.youtube.com/watch?v=sNvuH-iTi4c&t=6s Large Language Model Selection Masterclass - Nov 2025 https://www.youtube.com/watch?v=n4NokjyAklg&t=795s 10 Papers Every Future AI Engineer Must Read https://www.youtube.com/watch?v=v8WAV_y5iIQ&t=1s AI Engineering: A *Realistic* Roadmap for Beginners https://www.youtube.com/watch?v=dbUIjFXIpis ---------------------------------------- 🦫 About me I am a Senior Applied Scientist (basically, a blend of Data Scientist/Machine Learning Engineer) at Twitch/Amazon. Outside of my full-time job I'm a 1:1 career coach for people looking to break into the field, with a focus on those from non-traditional backgrounds. I’m also a Certified Personal Trainer, always busy with too many interests, and really, deeply happy with my life. I hope to be able to help others achieve these things, too. ---------------------------------------- ✉️ Contact Instagram: https://www.instagram.com/marina.wyss/ Twitter/X: https://x.com/iammarinawyss TikTok: https://www.tiktok.com/@gratitudedriven Leave me a comment here on YouTube! Business email: business@gratitudedriven.com ---------------------------------------- ⚖️ Disclaimer The views and opinions expressed in this video are my own and do not reflect the official policy or position of Twitch/Amazon or any other company I have worked for. All advice and insights shared here are based on my personal experiences and should be considered as such. Thank you to SerpAPI for sponsoring this video! This description may contain affiliate links. If you make a purchase I may make a small commission at no cost to you. #AI #aiengineering #machinelearning