Developing LLM Powered Applications: Llama 2 on Azure (5/n)

9 min readOct 15, 2023

Microsoft has opened up flood gates by joining hands with Meta by offering Meta’s open source Large Language Models (LLM) Llama 2 on Azure! Believe it or not, this is a big deal.

Amazon’s AWS released Amazon SageMaker Jumpstart end of last year which is a similar offer to Azure to support deploying open source models for public and private consumption. I will write an article on developing LLM powered applications using AWS foundation models in the coming weeks on this!

One can deploy a Llama 2 LLM on Azure and expose it to the world securely so it can be consumed by users or applications. This opens up a ton of commercial opportunities for organisations to integrate LLMs into software applications — thus creating “accelerated apps (Xapp)”.

In this article, I will go over the details of creating a simple Xapp that’ll integrate with an LLM (Llama 2 7 billion parameter model) model.

What is a Foundation Model?

A foundation model is a large language model (LLM) — which was trained on humongous publicly available datasets. They can be fine-tuned for various focussed and specific application use cases — thus helping in developing specialised applications. We could leverage their intelligence for various tasks such as content creativity, summarization, question and answering and more. We can even fine-tune them to integrate with our custom datasets too. OpenAI’s GPT-4, Google’s BERT, and Meta’s Llama 2 — all fall in this category of foundation models.

Meta released a set of open-source models under the Llama 2 umbrella: Llama 2 (7 billion), Llama 2 (13 billion) and Llama 2 (70 billion) parameters. As these models are open source — they are allowed to be used for commercial purposes — opening up a flood gates to re-invent and innovate software applications. An application prior to the LLM era may have to deduce intelligence (if any) by using code rules or rudimentary machine learning models.

Developers can leverage the pre-trained and fine-tuned models of Llama 2 to accelerate the development of more capable and intelligent applications.

Plausible Architectures

The more I explore, I find a bunch of plausible architectures around developing and deploying LLMs integrated software applications. While I can’t go into these architectures in detail here, I can surely present them for completeness:

Embedded LLM Service

This setup promotes locally running LLM service — a micro-service deployed as part of normal application deployment. The other micro-service applications in the deployment invoke the llm-servcie with appropriate inputs (prompts) and fetch the results (completion) from the LLM. The LLM is expected to be working with its pre-trained knowledge — meaning there is no additional feed of augmented data for its context.

In the figure shown here, the LLM will answer based on its pre-training:

The LLM is deployed and embedded as a service on the infrastructure — possibly as a micro-service — exposing the endpoints to the other micro-services. If another application from a different part of an organization wishes to use this setup, they will have to follow the same pattern over.

The potential cost of such architecture is the local deployment of a model — infra costs, request/response times, scalability and high availability, reliability on the model completions and so on.

If the applications are not mission critical or have tolerance for latency and LLM-Model hallucinations, perhaps this could be the architecture.

Orchestrated LLM Service

An LLM is deployed to a cloud (Azure, for example) and is expected to expose the model via pre-defined RESTful endpoints. Instead of running an embedded LLM service, we here create a llm-service cluster with multiple pods running on a container orchestrator such as Kubernetes.

Any application in the organisation can get hold of the contract exposed by the llm-service and integrate it with their applications.

The llm-service still fed no specific dataset, but there’s no reason why it can’t be fed with a private dataset if needed.

Cloud Managed LLM Service

Rather than we creating a huge AKS cluster to deploy our llm-services, wouldn’t it be better if cloud providers provided us this capability? We can choose a model and ask the cloud provider to simply deploy it in their pre-built cluster. This way, we can alleviate problems around securing, managing and maintaining model services in our own cluster.

The foundation models offered by cloud providers such as Azure and AWS via their ML workspaces can be deployed and exposed via RESTful endpoints. Any application, with appropriate security keys, can invoke the cloud based llm-service’s endpoints to integrate into their software application pipeline.

The advantage is the cost-effectiveness and the quick-to-market. Running on GPU based machines for a local setup may not be cost-effective when compared to running the same in clouds.

I managed to play with both AWS Sage Maker JumpStart (waiting for them to give me access to the Jumpstart for deploying Llama 2 models) and Azure’s foundation language models.

Let me show you how I managed to do deploy a Llama 2 (7 billion) chat model for public consumption.

Llama 2 Service on Azure

Head over to the Azure portal and create a Machine Learning workspace if you’ve not done so (search for Machine Learning in the search bar).

Follow the steps to get the ML space created for us. A workspace will be created as part of this creation process.

Here’s a detailed documentation if you need to refer to the step in details.

Once the workspace is created, you’d need to open up the Studio — a message something like the following will be shown when you are on the overview page of the ML workspace.

Once you launch the studio, the first step is to visit the model catalogue. The model catalogue is the new addition to ML space that consists of all the available foundation models — from openAI to Llama to Anthropic to Falcon. Remember, the model catalogue is in “preview” mode — so features can be modified and/or removed before the general availability.

Click on the View Models under the “Introducing Llama 2” button to find all the Llama 2 models that Azure supports as of now:

The model catlog with Llama 2 models amongst others

As you can see in the picture, all three types of models (7b, 13b and 70b) are supported — both text generation as well as chat type.

Choose the model you want to deploy — in my case, I picked up the Llama-2–7b-chat model. The next step is to deploy this model to a GPU based beefed up virtual machine.

We will be asked if we’d want a real time endpoint or batch based. As we want to have a chat based application — I chose real time endpoint. Batch endpoints for using the LLM to carry out bulk operations such as going over a ton of images, datasets and so on.

Azure has an inbuilt mechanism to provide a secure and safe endpoint — which prohibits LLMs from providing answers to self-harm, sexual, violence and hate related categories — I strongly advice to deploy this protected endpoint when you are deploying it for public use.

The model will be deployed along with other resources on a VM — usually GPU enabled machine. I went with a default “Standard_NC12s_v3” which has 12 virtual cores, 224GB RAM and 2 GPUs (a pretty beefed up machine).

The cost of this VM is a whopping $6/hr for a spot instance — so you’d be looking at $150 per day — so close to $5000/pm. In fact, I left my VM running overnight with an LLM endpoint and this is the cost I incurred :)

(Un)expected bill running Llama 2 overnight :)

Anyway, coming back to our deployment, go ahead and deploy the LLM on the chosen VM. Provide an endpoint name and other details. Once you submit for deployment, it would take a few minutes (I think it took over 15 minutes in my case) to get the deployment successful and endpoint ready.

There seem some restrictions on using non-US region VMs when working with the VMs with GPUs. I couldn’t get a VM deployed in the London region, so had to squash my workspace and start everything from scratch by picking up the us-east-2 region.

Once the endpoint is ready, you are good to go! You can go over to the sidebar on the ML studio page to check the “Endpoints” section for its availability. For example, my endpoint was deployed successfully as I can check that in the “Endpoints” section:

Llama Service’s Endpoint and Swagger URI

Keep a note that the endpoint is accessible only with API Keys — the keys are available under the Consume section of the endpoint:

API Keys to access our Llama 2 Service Endpoint

You can use primary or secondary keys to invoke the endpoint. We will see how we can use that using Postman as well as a Python program.

LLM is at your service

You have an LLM (Llama-2–7b model) deployed and at your service!

Let’s test this using postman as the endpoint is surely a public one. I’ve created a POST request and stuck my endpoint:

Postman test invoking our Llama 2 Service deployed on Azure 2

We need to provide the request as a body in a JSON format — as you can see I’ve requested the LLM to provide me an answer to the question around DOGE coin.

You would need to provide the API key via your Authorization tab — make sure you use the “Bearer” token method and paste one of the keys from the authentication keys section.

All the hard work is done — the final step is to invoke the endpoint and wait for the result!

Yay! We managed to invoke the LLM deployed on Azure from a local postman.

The same can be done by invoking the endpoint via a program — as the Python client shows here:

server_url = "https://llama2.eastus2.inference.ml.azure.com/score"

# Initialize OpenAI API key
api_key = "vCWWzxvpxRUFGtGmS0U0xxxx"

# Function to ask user a question and get an answer 
def ask_user():
    headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key), 'azureml-model-deployment': 'llama-2-7b-chat-12' }

    if user_question:
        # Call the Flask API endpoint to get the answer
        response = requests.post(
            server_url,
            user_question,
            headers=headers
        )
                    
        answer = response.json()
        st.write(answer)

As you can see, we are invoking the endpoint directly from our Python code!

That’s pretty much for this article. In the upcoming article, we will look at AWS’s SageMaker JumpStart as well as fine-tune these foundation models for specialised and custom use cases

Wrap up

Foundation models are pre-trained models provided for us by cloud providers — our job is to get them deployed to the cloud environments and get an endpoint so it can in invoked from our applications.

Deploying them on our own infrastructure can be a bit involved and intensive — we need to secure them, manage and maintain them and it would become a tiny bit of a hassle.

Cloud providers, such as Azure and AWS, fortunately created a mechanism to deploy and use the LLMs as a PaaS (and IaaS) service. We can surely take advantage of the platform support they provide.

Stay tuned!

Me @ Medium || LinkedIn || Twitter || GitHub