Loading video player...
[Music]
A critical component in the path to
building a robust AI serving platform is
a selection of which model serves the
use case it will be applied to. The AI
researcher goes to the Google Cloud
documentation page and looks through
best practice guidance. The researcher
also notices that Google Cloud recently
launched an inference quickart service
with a robust API. The researcher now
sees that a Google Collab notebook is
also offered for the API.
After setup and authentication to the
Google Cloud project, the research is
ready to begin.
Here, a function builds a table
comparing Google benchmark models to the
ELO rating from Chatbot Arena, a
benchmark platform for LLMs that feature
anonymous randomized battles. The collab
notebook is also tied to the billing API
and the researcher can select the
correct pricing model for their Google
cloud usage from one and three-year
committed use discounts to on demand and
spot.
Now the researcher sees a chart that
compares the different open source
models comparing the relative
intelligence of the model to a common
metric used in AI serving platforms to
understand the cost to serve the model
price per million output tokens. This
chart is a great way to narrow down the
choices available and see which model
can serve the most use cases with the
least amount of cost. Here we can see
that while GPTOSs 20B is very low in
cost and has a good ELO rating, the
researcher decides to go with a smaller
model and looks at Metal Lama 3 8B.
Now some technical specifics on how the
model will be served such as the model
server platform and version here VLM
version 07.2 two are added as variables
for the rest of the notebook.
The researchers presented with a way to
simulate different model and performance
goals to visualize the data that was
benchmarked by Google. The pricing model
performance metrics such as normalized
time per output token Napot and time to
first token TTFT cost metrics such as
cost per million of input output tokens
are all metrics that can be used to
narrow down the model accelerator and
machine type.
Now the notebook can populate different
charts examining any dimension of the
data. Here as an example, the first
chart shows throughput versus latency of
the selected model across the benchmark
machine type which on Google cloud
represents specific NVIDIA GPU. The
researcher easily sees that the A3 high
GPU 1G instance has the best performance
profile.
This is just the start of the different
ways the researcher can identify what
they need to make a determination of the
correct model to achieve performance and
or cost goals. Since a notebook is
Python, the researcher can also tap into
their own key data sets that can assist
in their final determination. This
allows the researcher to give guided
datadriven recommendations to their
platform team to create the
infrastructure needed to meet the
demands of the AI service.
Now the platform engineer can take the
researcher's findings and by using the
GKE MCP server in the Gemini CLI they
can verify and interact with other
platform development tools driven by
natural language interface.
Here we see the engineer verifying the
researcher's findings in the CLI
directly against the same inference
quickart API the researcher was using in
their collab notebook. Through this
natural language interface, the platform
engineer gets back results in an easy to
read and understand format.
The data serves as context to the
engineer's next prompt to build a
Kubernetes manifest for the model, model
server, server version, and accelerator
chosen. The Gemini CLI driven by the
powerful Gemini family of models knows
how to use the GKE MCP server to call
the inference quickart API to get the
manifest prescribed by the service.
The GKE MCP server will determine the
best Kubernetes manifest for a
deployment based on the benchmarked
profile. Through the rich interactive
interface of the Gemini CLI, the
engineer can now ask Gemini to save the
manifest file to local storage and
continue to do other operations on the
file before committing to their
production CI/CD or infrastructure as
code platform. Let us take a closer look
at the manifest provided to see exactly
how the inference quickart API is giving
the platform engineer a head start in
serving up this inference platform.
Here we see the pod monitoring
specification is provided for VLM to
output metrics to the Google manage
Prometheus service. A horizontal pod
autoscaler was also created targeting
the VLM metric GPU cache usage
percentage and the deployment itself has
key VLM server arguments used during the
benchmarking process to tune the model
server correctly. This essentially gives
a boost for the project saving post-
deployment tuning work for the AI
engineers and platform teams which is
realized as a faster time to market for
a critical project for their
organization.
GKE inference quickart is the starting
point in your fast path to production AI
serving on Google Kubernetes Engine and
Google Cloud. Thank you and we invite
you to learn more with these resources.
[Music]
Gemini CLI → https://goo.gle/4nIRBQ4 GKE AI Labs → https://goo.gle/4hmOHhT AI/ML orchestration on GKE→ https://goo.gle/3KJI38Y GKE Inference Quickstart is the starting point in your fast path to production AI serving on Google Kubernetes Engine (GKE) and Google Cloud. With GKE Inference Quickstart: Verified Model Benchmarks by Google Cloud Streamlines Model Selection through different Cost and Performance data points Unlocking Faster Time to Market and a well lit path to Production Deployment. Resource links: Analyze model serving performance and costs with GKE Inference Quickstart → https://goo.gle/3J4DA02 🔔 Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech Speaker: Eddie Villalba Products Mentioned: Google Kubernetes Engine, Inference Quickstart, Google Cloud, GKE