How to Build AI Engineering Projects That Get You Interviews | DailyDevLists

Loading video player...

Full Transcript

3,813 words • EN

One of the most disappointing

realizations for beginner AI engineers

is that the projects they've done while

learning won't actually help them land a

job. These standard chatbot tutorials

and rag followalongs simply aren't

complex enough or close enough to what

we do in industry to show employers that

you have the skills they need. The

problem is as a beginner, you've never

seen a production AI system. You don't

know what you need to build. So you do

the next best thing and follow example

projects you find online. And now you

have the exact same chatgbt rapper as

everyone else. Instead, what you need is

a framework for how to build real

endtoend AI applications that are unique

and demonstrate actual engineering

skills. In my years of sitting on hiring

panels at companies like Amazon and in

my career coaching work, these kinds of

robust, well-engineered projects are the

ones that actually make a difference in

your career. Today, we're breaking down

that framework. I'm not sharing example

projects to follow, but instead a

step-by-step guide on everything you

need to think about and include to build

something that is legit. Let's get

started. Here are the eight core

components we're going to discuss.

Problem framing and success metrics.

What are we actually solving and how do

we know if we've solved it? Prompt

engineering and systematic tracking. How

to version, evaluate, and improve

prompts like actual engineering work.

Model selection and evaluation. A

systematic way to compare different

models and make informed decisions. Rag

building a retrieval system with smart

chunking, ranking, and evaluation. Agent

systems with a solid understanding of

design, security, and error handling.

System monitoring and error analysis.

tracking performance at every level and

actually learning from failures.

Deployment and user interface, making

your AI system accessible and reliable.

And as a bonus, fine-tuning. This is

less common in industry, but it's still

a super valuable skill to demonstrate.

At the end, I'll share some learning

resources that can help you dig into

details on everything we talked about

today. All right, let's get into it.

Starting with the thing most people

completely skip, problem framing and

success metrics. I know that sounds

really basic. Obviously, you need to

know what problem you're solving, right?

But you'd be surprised at how many

projects completely skip this step. It's

easy to get excited about a cool new

tool and try to find some way to use it,

which is fine as long as you can solve

something important along the way. But

really, this is the opposite of how we

approach projects in industry. In

reality, we start with a business

problem and then we figure out if an LLM

is even the right tool to solve it. And

if it is, we define very clearly what

solved means. Once you've decided an LLM

is the right approach, you need to think

about the constraints that are specific

to AI engineering. This might include

things like latency and cost constraints

as well as a certain quality bar you

need to reach. By this I mean what does

good enough actually mean for your use

case. If you're building a medical Q&A

system, you need extremely high accuracy

and you probably need citations. If

you're building a creative writing

assistant, users might be fine with

outputs that are 80% useful and they'll

edit the rest. Measuring quality can be

tricky, though. Defining success metrics

for AI systems is different from

traditional ML because you're often

dealing with more subjective outputs.

You need metrics at multiple levels.

Userfacing metrics answer, "Is this

actually useful?" These metrics might

include task completion rate or user

satisfaction ratings. Technical metrics

answer, "Is the AI performing well?" For

these, you might measure response

quality using an LLM as judge, human

ratings, or accuracy on test cases. And

system metrics answer, is this

sustainable? These include latency, how

fast are the responses, cost per

request, can we afford to run this at

scale? Error rate, how often does the

system fail? and uptime. Is a service

actually available for users? The key is

that these metrics need to be measurable

and you need to track them over time. As

you work on scoping your project, make

sure you write down the decisions you

made and why. This can be as simple as

adding it to a readme, but you'll

definitely want to document so you're

prepared in interviews. All right, now

that we know what problem we're solving

and how we're going to measure if we've

solved it, let's talk about a component

that separates amateurs from

professionals. Prompt engineering and

systematic tracking. One thing that's

not clear from most explanations of

prompt engineering is that it's less art

and a lot more systematic. Here are some

simple things you can do to improve the

quality a ton and have your project

stand out. First, treat your prompts

like separate components of your system.

Don't hard- code them directly in files,

but instead have separate files for each

version of your prompt. Something like

this. Next, build a structured

evaluation framework. Create a set of

test inputs with the expected outputs or

at least criteria for what a good output

looks like. A good test set should cover

different types of questions, difficulty

levels, and edge cases, and it should be

updated regularly based on new failure

patterns you discover as you work.

Ideally, you'd want one to 300, but

start with what you can get. Then, when

you change your prompt, run it against

the test set to measure performance. You

can use metrics like blue and rouge,

exact matching for classification tasks,

or LLM as judge, which is where you use

a strong model like GPT5 to evaluate the

quality using a consistent rubric. For a

simple project, you can do your prompt

evaluation more manually. But there are

also tools like prompt layer, length,

fuse, and weights and biases that can

help you with prompt versioning and

evaluation tracking. All right, so now

you have a way to do systematic prompt

engineering. But which model should you

actually use for your project? That's

what we'll need to figure out next.

Model selection and evaluation. A common

beginner mistake I see is to just

default to using the newest model for

everything because it's clearly the best

model, right? But is it actually best

for your specific use case? Is it worth

the cost? Is it fast enough? Would a

smaller, cheaper model work just as

well? You don't know unless you actually

test it. In industry, we're constantly

evaluating different models because cost

and performance matter. A model that's

twice as expensive but only 5% better

might not be worth it. I have an entire

video on how to do model selection in

detail, but at a minimum, you'll want to

run a few models from different

providers and different sizes on your

test set and then compare both

performance and cost using some

consistent rubric and metrics. As usual,

make sure to document your process in

your readme. It might be that you find

one model is the right balance between

cost and performance. Or maybe you need

a combination. Model routing is a more

advanced method that works well in

practice. This is when you use a cheap,

fast model to classify the difficulty of

the incoming query, then route it to the

appropriate model from there. Simple

queries can go to a cheap and fast model

and complex queries can go to a more

expensive higher quality one. All right,

so now you know how to systematically

choose models. Next up is one of the

most important components of modern AI

engineering systems, Rag. But before we

get into Rag, I want to talk about

something that's essential for building

real AI systems, especially when we get

into agents in a few minutes. Access to

live data from the web. Here's the

thing. If you're building an agent or

any application that needs current

information, you need a way to access

real-time data from the web. And web

scraping is a pain in the butt. You have

to deal with captions, rate limits,

changing HTML layouts, proxies, lots of

annoying stuff that's a distraction from

your project work. SER API handles all

of that for you. They provide clean,

structured search results from Google,

YouTube, Bing, and more through a simple

API. You make one API call and get back

a clean JSON object with exactly the

data you need. This is super helpful for

AI engineering projects. Want to build

an agent that can search for current

information? Use search API's Google

search API. Training a model and need a

data set of classified images. Their

Google images API gives you

pre-classified image titles, URLs, and

thumbnails. Building a research

assistant. Their Google Scholar API

gives you access to peer-reviewed

articles with full metadata. We're

actually going to talk about agents in a

few minutes, and many production agent

systems use SER API as one of their core

tools. You can get started with 250 free

credits by clicking the link in the

description or scanning the QR code on

screen. Thank you so much to SER API for

sponsoring this video. All right, now

let's talk about Rag. Rag stands for

retrieval augmented generation, which is

when we connect an LLM to a data source

like our company database. The choices

you make about chunking, embeddings, and

retrieval strategies have a massive

impact on your systems performance. And

showing that you actually thought about

these decisions is what separates your

project from one of the dozens of RAG

tutorials out there. Let me break down

what you need to focus on. Chunking is

when we break up documents into chunks.

This is one of the most important

decisions in your RAG pipeline. There's

no universal right answer on how to do

this. You can try things like fixed size

chunking at different sizes, semantic

chunking where we split by meaning, or

many other options. Test different

chunking strategies on your specific use

case. For each strategy, measure

retrieval accuracy. Again, we're going

to create a test set this time of 20 to

30 questions where you know which

sections of your documents contain the

right answer. Then you measure what

percentage of the time your retrieval

system actually returns those sections.

The next decision is what embedding

model to use and what vector database to

store the embeddings in. Embeddings are

numerical representations of your

documents. To transform text into

embeddings, we use a pre-trained

embedding model most of the time. Open

AAI's text embedding models are solid

and really easy to use. Open source

alternatives like sentence transformers

let you run models locally for free.

Again, try different ones and test them

out. Once you've made these embedding

vectors, you need to store them

somewhere. For vector storage, you can

start working with cloud platforms right

away, which I definitely recommend. and

try out AWS Open Search or something

similar on GCP or Azure. Once you've

created your embeddings and stored them

in a vector store, you need some way for

the system to find the relevant

embeddings. This is the search step. The

simplest approach is semantic similarity

search. You convert the user's question

into an embedding as well. Then you find

the chunks with the most similar

embeddings. That's your baseline. Hybrid

search combines keyword search with

semantic search. Maybe a chunk isn't

semantically similar to the query

embedding, but contains the exact

technical term the user mentioned. In

that case, hybrid search would catch

that. Most modern vector databases

support this right out of the box. Once

you have your search strategy, there are

a couple more things you can do to

improve your results. Reranking is a

two-stage approach. First, you retrieve

maybe 20 chunks using a fast vector

search. Then, you use a more accurate

but slower model to rerank those 20

chunks and pick the best five. This is

common in production systems because it

balances speed and accuracy. Query

expansion means you don't just search

what the user's exact question was. You

might use an LLM to rephrase the

question to be a bit more specific,

generate multiple variations, or extract

key entities and concepts. For example,

if a user asks, "What's its population?"

after previously asking about Paris, the

query might be expanded to say, "What's

the population of Paris?" You don't need

to implement all of these for a

portfolio project, but testing at least

one or two and documenting the

improvement shows that you understand

retrieval is more than just basic

similarity search. Once you've put

together all these components, it's time

to evaluate your rag pipeline. We want

to not only evaluate the quality of our

entire system, but also each component.

Here are some metrics to consider.

Retrieval accuracy. What percentage of

the time do you retrieve the correct

chunks? Use metrics like precision at K

or recall at K. Answer accuracy given

good chunks. Does the LLM produce the

correct answer when it receives the

right context? This isolates whether

your LLM is the problem or your

retrieval is the problem. End toend

accuracy. Does the whole system work? By

measuring separately, you can diagnose

exactly where failures happen and

hopefully fix it. All right, so now you

have a solid rag system. But what if you

need your AI to actually do things, not

just answer questions? That's where

agents come in. An agent is basically an

LLM that can use tools and take actions

autonomously to accomplish a goal.

Instead of just answering a question, it

might search the web, run code, query a

database, call an API, and synthesize

the results. For a portfolio project,

including an agent component shows you

can build complex autonomous systems. I

have a really comprehensive course on

agent systems that goes into all the

details. I'll add a link to that. But

for now, we'll just cover this at a high

level. For your project, you can use a

framework like Langraph or Crew AI, or

you can build your own agent system

using function calling features built

into OpenAI or Anthropics APIs. The

important thing is that you understand

what's happening under the hood, not

just that you copied example code. And

you've thought about the things that

separate toy projects from production

systems. Things like error handling. LM

make mistakes. Your agent needs to

handle this gracefully. Also security.

If your agent can execute code or call

APIs, you have security concerns.

Monitoring agent behavior. Log

everything your agent does, what tools

it called, what the outputs were, and

what decisions were made at each step,

how long each step took, and where it

succeeded or failed. This lets you debug

when things go wrong and improve your

agent over time. As usual, we need to

test this component of our system.

Agents are hard to test because they're

non-deterministic and multi-step. At a

minimum, you need unit tests for

individual tools, integration tests for

complete workflows, and some adversarial

testing to see what happens if a user

asks for something malicious. Create a

test set with at least 10 to 15

representative tasks from simple to

complex. Measure task completion rate,

and average steps to completion. All

right, so you've built all these cool

components. You have prompts, rag, and

maybe an agent. But if users can't

actually interact with your system, it's

not really a complete project.

Deployment and UI are where you make

your AI system something that's actually

useful. The most professional approach

to deployment is building a REST API

that serves your AI system. Fast API is

the standard choice for Python. It's

fast, it has automatic documentation,

and it handles async operations well,

which is important for LLM calls that

can take several seconds. The key things

to handle are streaming responses so

your UI feels faster, error handling,

because again, LLM APIs fail sometimes,

rate limiting to prevent abuse, and

authentication with at least a simple

API key. Then you need to put it

somewhere people can actually use it,

which usually means hosting your fast

API app on something like AWS or GCP so

it has a stable public URL. Once it's

live, you could create a UI with

Streamlit or Graddio for a really simple

setup or React or Nex.js app if you want

it to look a little bit more like a

product. Okay, so let's say you've

deployed your project and people are

using it. Hooray. But now you need to

know if something breaks. That's where

monitoring comes in. Monitoring is what

separates a demo from a real system. And

for a portfolio project, having even

basic monitoring is a really strong

signal. Here's what to monitor for each

component. For prompts, track response

quality scores, format compliance,

refusal rates, and average response

length. If your refusal rate suddenly

spikes from 2% to 15%, something changed

and you need to investigate that. For

RAG, track retrieval confidence scores,

number of chunks retrieved, source

diversity, and retrieval latency. If

confidence scores are dropping over

time, your knowledge base might be

getting stale, or maybe user queries are

shifting in some way. For agents, track

task completion rate, average steps to

completion, tool success rates, error

types, and cost per task at a minimum.

There's a lot with agents. If your agent

is suddenly taking 15 steps to complete

tasks that used to take five steps, for

example, that means something's wrong.

For the overall system, track end to end

task success, user satisfaction,

latency, cost per request, error rate,

and uptime. In order to monitor all of

this, you need a logging system. This is

when we track what happened with each

request. At a minimum, log the time

stamp, user query, what components were

used, things like which chunks were

retrieved, which model was used, what

prompt version, etc. The response,

latency, cost, and any errors that

occurred. For a simple portfolio

project, write these logs to a file or

like a SQLite database. For something

more advanced, you can use a proper

logging tool or service. All right, so

now you've got monitoring place and a

working system. Now, let's talk about

fine-tuning, which is more advanced, but

can really improve the performance of

your project. You might be wondering why

I put fine-tuning all the way at the end

in a bonus section. The reason is

fine-tuning is actually less common in

industry than people think. Most of the

time, a well-gineered prompt gets you

90% of the way there. Fine-tuning is

only worth it when you've already

optimized everything else and you need

that little extra performance boost or

when you have very specific requirements

that prompting can't solve. Here are

some realistic use cases for fine-tuning

in your project. Consistent output

formatting. If you need your model to

always output a very specific JSON

structure and prompting isn't reliable

enough, fine-tuning can help. matching a

larger model's performance with a

smaller model. Maybe Sonnet 4.5 gives

you great results, but it's too

expensive or slow. You could fine-tune

Haiku 3.5 on examples from Sonnet 4.5 to

get similar quality at a lower cost.

Domain specific language. If you're

working with specialized terminology

like medical, legal, or technical

documentation, fine-tuning on domain

specific examples can improve

performance. And for RAG systems, you

can fine-tune your embedding model on

domain specific data to improve

retrieval quality. The key is to only

fine-tune if you have a clear reason and

you've already tried everything else. If

you do decide to fine-tune, here's how

to do it properly. You're going to need

to create a highquality training data

set. This is the most important part.

Your fine-tuned model is only as good as

your training data. You need at least 1

to 500 example inputs and ideal outputs.

These should be real examples from your

use case. Then establish a baseline

before you fine-tune. Measure

performance with your best prompt on a

held out test set. This is your baseline

to beat. Train multiple versions. Train

with different amounts of training data,

different numbers of epochs, different

learning rates, and track everything.

Finally, compare performance. Does the

fine-tuned model actually beat your

baseline? By how much? Is the

improvement worth the effort and cost?

All right, so we've covered all eight

components. Now, let's talk about how

they all fit together and what your

final project should look like. Let me

show you how all the pieces we've

discussed fit together. A user submits a

query through your UI. The request hits

your API and gets logged with metadata.

If you have an agent, it decides whether

it needs rag, which tools to use, and

what actions to take. If rag is needed,

the query gets converted to an

embedding. relevant chunks are retrieved

and reranked and those chunks become

context. Your prompt template gets

filled with that context, examples, and

instructions. The appropriate model

based on your model selection logic

generates a response potentially with

streaming. The response gets validated,

post-processed if needed, and returned

to the user. Everything gets logged. The

query, retrieve chunks, model use,

latency cost, any errors, all that

stuff. Your monitoring system analyzes

these logs regularly. You review

failures, categorize error types, and

use that data to improve your prompts,

rag configurations, or agent logic, and

the cycle continues. Your system gets

better and better over time based on

real usage data. But remember, you don't

have to build all of this at once. Start

simple. Build a basic version with one

prompt and basic rag. Get that working.

Then add complexity. Try different

chunking strategies. Add an agent.

Implement monitoring or optimize your

model selection. Measure the impact of

each change. Don't just add features

because they sound cool. add them

because they solve a specific problem or

improve a specific metric. Projects like

this can take weeks or months, and

that's totally normal. I'd much rather

see one really solid projects than five

half-finish kind of shitty ones. It's

definitely a quality over quantity

situation. If you want to learn more

about anything we discussed today, some

recommendations that I have for learning

would be the book AI engineering by Chip

Huen or I have many different videos

covering these topics, which I'll link

in the description. So, that wraps up

our project template. Like I said, I

have lots of videos that can help break

down each of these components, or check

out my AI agents course up next. Thank

you so much for watching, and I'll see

you next time.

How to Build AI Engineering Projects That Get You Interviews

Marina Wyss - AI & Machine Learning

62 days ago

18:07

Ai Whitelist

AI Whitelist

Rank #1

Description

Check out SerpAPI and start building production-ready projects (agents!!!) today: https://serpapi.link/marina-wyss AI Engineering book: https://amzn.to/44Nd7vK Timestamps ⏰ 00:00 How to make real AI Engineering portfolio projects 01:47 Problem framing and success metrics 03:41 Professional prompt engineering 04:57 Model selection and evaluation 07:15 RAG 10:41 Agents 12:09 Deployment and UI 13:09 Monitoring 14:32 Fine-tuning 16:12 The complete system 17:37 Recommended learning resources ---------------------------------------- Want to become an AI Engineer? Download my AI Engineering Skills Checklist here: https://www.gratitudedriven.com/c/ai-engineering-checklist 💬 Want to talk 1:1? Book time to chat with me here: https://topmate.io/marina_wyss 💀 Follow my second channel for more on mindset, productivity, and meaning: https://www.youtube.com/@GratitudeDriven 🩷 Join the channel membership community for priority comment replies and early access to videos! https://www.youtube.com/@MarinaWyssAI/join ☕ If you'd like to support my work, you can buy me a coffee (thank you!): https://ko-fi.com/marinawyss ---------------------------------------- 🎥 Other videos you might like: AI Agents in 38 Minutes - Complete Course from Beginner to Pro https://www.youtube.com/watch?v=sNvuH-iTi4c&t=6s Large Language Model Selection Masterclass - Nov 2025 https://www.youtube.com/watch?v=n4NokjyAklg&t=795s 10 Papers Every Future AI Engineer Must Read https://www.youtube.com/watch?v=v8WAV_y5iIQ&t=1s AI Engineering: A *Realistic* Roadmap for Beginners https://www.youtube.com/watch?v=dbUIjFXIpis ---------------------------------------- 🦫 About me I am a Senior Applied Scientist (basically, a blend of Data Scientist/Machine Learning Engineer) at Twitch/Amazon. Outside of my full-time job I'm a 1:1 career coach for people looking to break into the field, with a focus on those from non-traditional backgrounds. I’m also a Certified Personal Trainer, always busy with too many interests, and really, deeply happy with my life. I hope to be able to help others achieve these things, too. ---------------------------------------- ✉️ Contact Instagram: https://www.instagram.com/marina.wyss/ Twitter/X: https://x.com/iammarinawyss TikTok: https://www.tiktok.com/@gratitudedriven Leave me a comment here on YouTube! Business email: business@gratitudedriven.com ---------------------------------------- ⚖️ Disclaimer The views and opinions expressed in this video are my own and do not reflect the official policy or position of Twitch/Amazon or any other company I have worked for. All advice and insights shared here are based on my personal experiences and should be considered as such. Thank you to SerpAPI for sponsoring this video! This description may contain affiliate links. If you make a purchase I may make a small commission at no cost to you. #AI #aiengineering #machinelearning

Video Details

Category

Feed

AI Whitelist

Featured Date

December 30, 2025

Quality Rank

#1

AI Recommended