Back to Blog

Best Practices for Azure OpenAI API Performance

Jul 1, 2025

Want faster, more reliable Azure OpenAI API performance? Here's how:

To maximize your Azure OpenAI API's efficiency and reduce costs, focus on latency, throughput, and token management. Choose the right model for your workload (e.g., GPT-3.5-turbo is faster than GPT-4), optimize token usage, and monitor key metrics like response times and error rates. Use batching, prompt optimization, and Azure API Management (APIM) to streamline operations.

Key takeaways:

  • Latency vs. Throughput: Latency is response time; throughput is tokens processed per minute.
  • Token Tips: Shorter outputs improve speed; set max_tokens as close as possible to expected output size.
  • Error Prevention: Avoid exceeding rate limits (e.g., tokens per minute) to reduce disruptions.
  • Batching: Combine multiple requests into one to save time and money.
  • Monitoring Tools: Use Azure Monitor for tracking performance, token usage, and error rates.

Efficient deployment and consistent monitoring ensure your API scales smoothly while keeping costs under control.

Azure OpenAI Service - Rate Limiting, Quotas, and throughput optimization

Azure OpenAI Service

Key Performance Metrics You Need to Know

If you're looking to get the best performance out of your Azure OpenAI API, keeping an eye on key metrics is essential. These metrics give you a clear picture of how your system is doing and point out areas where you can make improvements.

Latency and Throughput

Latency refers to how long it takes for the model to respond. This can change depending on the model you’re using, the length of your prompt, and the system's current load. Adding content filtering for safety can also increase latency.

Throughput, on the other hand, measures how many tokens are processed per minute (TPM). Several factors influence throughput, such as the quota assigned to your deployment, the number of input and output tokens, and the rate of API calls. For non-streaming requests, you’ll want to measure the total time it takes to complete the request. For streaming requests, focus on metrics like time-to-first-token and the average token rate. By analyzing latency and throughput together, you can pinpoint performance bottlenecks more effectively.

Prompt Size and Token Limits

The size of your prompt has a direct impact on performance, so understanding token metrics is key:

  • 1 token ≈ 4 characters
  • 1 token ≈ ¾ of a word
  • 100 tokens ≈ 75 words

Interestingly, the relationship between prompt size and latency isn’t straightforward. For example, cutting your prompt size in half might only reduce latency by 1–5%. However, reducing output tokens by 50% could lower latency by as much as 50%. Azure OpenAI determines the maximum number of tokens processed per request based on several factors: the prompt text, the max_tokens setting, and the best_of setting. To optimize both performance and costs, set the max_tokens parameter as close as possible to the expected response size. This is a critical part of prompt optimization, which will be discussed further in later sections.

Error Rates and Rate Limits

Tracking metrics like response times and throughput is only useful if you’re also managing error rates and staying within rate limits. Rate limits dictate how often the API can be accessed and are typically measured in terms of RPM (requests per minute), RPD (requests per day), TPM (tokens per minute), TPD (tokens per day), and IPM (invocations per minute).

These limits are set at the organization and project level, not per user, and they can vary depending on the model. Azure OpenAI assigns a per-account TPM quota, which can be divided across deployments. The RPM rate limit is defined as 6 RPM for every 1,000 TPM.

If you exceed these limits, you’ll encounter HTTP 429 errors, which can disrupt the API’s reliability and negatively impact user experience. Additionally, some model families share rate limits, meaning heavy usage of one model could affect the performance of others. To avoid these issues, use Azure Monitor to track metrics like response times, token usage, and error rates. Key metrics to monitor include Azure OpenAI Requests, Active Tokens, Generated Completion Tokens, Processed Prompt Tokens, Time to Response, and Tokens per Second. These insights can help you address potential bottlenecks before they become a problem.

Best Practices for Better API Performance

Improving API performance isn't just about monitoring metrics; it's about implementing smart strategies that enhance both efficiency and reliability. Below are some effective methods to get the most out of your API operations.

Batching and Workload Separation

The Azure OpenAI Batch API is designed to handle high-volume tasks efficiently. By bundling requests into a single JSONL file, you can significantly cut costs - up to 50% compared to standard pricing. Each batch file can include as many as 100,000 requests and is typically processed within 24 hours or less [19, 20]. When submitting these batches, ensure the JSONL format aligns with the appropriate model attribute.

To avoid disruptions in your online workloads, batch requests operate within their own enqueued token quota. Activating dynamic quota for global batch deployments can further reduce the risk of job failures due to token shortages.

Resource Management Techniques

Selecting the right scaling method - horizontal, vertical, or auto-scaling - is critical for resource optimization. Incorporate retry logic with exponential backoff to handle rate limits more effectively. As Paul Singh, a Software Engineer, puts it:

"With APIM, this will allow us do this easily... using the concept of retries with exponential backoff."

To manage sudden workload spikes, gradually increase load and test configurations thoroughly. Adjust token quotas per minute (TPM) based on traffic patterns, and consolidate smaller requests into larger ones to improve overall efficiency. For production environments, Provisioned Throughput Units (PTUs) offer reserved capacity, ensuring consistent performance. Meanwhile, standard deployments are more suitable for development and testing purposes.

Prompt Optimization

Reducing output tokens by 50% can cut latency by a similar margin [2, 23]. For better efficiency, set the max_tokens parameter as low as your use case allows, and use stop sequences to eliminate unnecessary output. Additionally, lowering the n (number of completions) and best_of parameters minimizes processing time.

Crafting precise prompts is equally important. Avoid vague instructions and provide clear examples to specify the desired output format. For fact-based responses, setting the temperature to 0 ensures the most reliable results. Lastly, choose models that fit your needs - GPT-3.5 models generally respond faster than GPT-4, and smaller models are often more cost-effective [2, 13].

Prompt Size (tokens) Generation Size (tokens) Requests per Minute Total TPM PTUs Required
800 150 30 28,500 15
5,000 50 1,000 5,050,000 140
1,000 300 500 650,000 30

This table highlights how varying prompt and generation sizes affect resource demands, enabling you to plan deployments more effectively.

Infrastructure and Deployment Setup

Building a strong infrastructure is essential for getting the most out of your Azure OpenAI APIs. A solid setup ensures your system can handle traffic surges, maintain uptime, and scale efficiently alongside your business needs.

Using Azure API Management (APIM)

Azure API Management

Azure API Management (APIM) acts as a gateway between your applications and Azure OpenAI services. As Microsoft explains:

"Azure API Management provides a robust platform to expose, manage, and secure APIs. It acts as a facade between your backend services and consumers".

APIM comes packed with features that boost performance. For instance, it can cache frequent responses to cut down on latency and balance traffic loads to avoid bottlenecks. It also enforces quotas, throttles requests to control traffic, and transforms requests and responses to maintain consistency. Transformation policies can help standardize incoming and outgoing data.

To get the most out of APIM, fine-tune your policy expressions and build resilient error-handling mechanisms. Keep an eye on reliability metrics to catch issues early. For added security, narrow down the supported TLS versions and offload tasks like JWT validation, IP filtering, and content checks to API policies.

There are real-world examples of how APIM can drive success. Visier, for instance, created "Vee", a generative AI assistant that serves up to 150,000 users per hour. By using Provisioned Throughput Units (PTUs) with APIM, they achieved response times that were three times faster and also cut down on compute costs. Similarly, UBS launched "UBS Red", which supports 30,000 employees across different regions by implementing region-specific deployments.

By combining effective API management with resilient deployment strategies, you can ensure your services maintain high quality even during heavy traffic.

High Availability and Fault Tolerance

Disruptions - whether planned or unexpected - are inevitable, so it's critical to prepare for them. Multi-region deployments can help minimize latency and meet data residency requirements. Azure OpenAI offers three deployment types to address different needs:

  • Global deployments: Requests are routed through Microsoft's global infrastructure. This is the most cost-effective option, with data residency maintained at rest.
  • Regional deployments: Data is processed within a specific Azure region (28 regions available), which is ideal for compliance and performance needs.
  • Data Zones deployments: Data is processed within designated geographic zones (e.g., EU or US), offering a balance between compliance and cost.

Strong monitoring and alerting systems are key to fault tolerance. Enable diagnostic logs for quick troubleshooting and use Azure Monitor to gather and analyze telemetry data. Keeping track of token usage and request rates ensures you stay within your quotas, reducing the risk of service interruptions.

Resource management also plays a big role. Plan your PTU allocation based on anticipated growth, monitor usage regularly, and apply consistent tagging with Azure Policy to enforce organizational standards.

A great example of effective deployment is Ontada, a McKesson company. They used the Batch API to process over 150 million oncology documents, unlocking 70% of previously inaccessible data and cutting document processing time by 75% across 39 cancer types.

Deployment Type Data Residency Cost Best Use Case
Global At rest only Most cost-effective Variable demand, cost optimization
Regional At rest and processing Higher cost Compliance needs, specific regions
Data Zones Geographic zones (EU/US) Middle ground Balancing compliance and cost

Choose a deployment strategy that aligns with your requirements. By incorporating redundancy and continuous monitoring, you can ensure your APIs are both scalable and reliable.

sbb-itb-903b5f2

Monitoring and Performance Tracking

Shifting from reactive problem-solving to proactive optimization starts with effective monitoring. Without a solid tracking system in place, you might overlook performance bottlenecks or miss out on opportunities to fine-tune your setup.

Using Azure Monitoring Tools

Azure Monitor serves as the go-to solution for keeping tabs on API performance. It gathers metrics and logs from your Azure OpenAI resources, offering a clear view of availability, performance, and resilience. As Microsoft explains:

"When you have critical applications and business processes that rely on Azure resources, you need to monitor and get alerts for your system".

Azure OpenAI includes built-in dashboards accessible through the AI Foundry resource view and the Azure portal. These dashboards organize key metrics into categories like HTTP Requests, Token-Based Usage, PTU Utilization, and Fine-tuning. By configuring diagnostic settings, you can send logs and metrics to a Log Analytics workspace, unlocking deeper insights with tools like Azure Monitor's Metrics explorer and Log Analytics. Setting up alerts for unusual metrics allows you to tackle potential issues before they escalate.

It's worth noting that response latency for a completion request can fluctuate based on factors like the model type, the number of prompt tokens, the number of generated tokens, and overall system usage. A well-structured monitoring setup feeds into regular reviews, ensuring sustained performance over time.

Regular Reviews and Updates

Regular reviews are essential to keep your monitoring and performance tracking aligned with your business needs. Monthly assessments help identify trends, uncover optimization opportunities, and adapt to shifts in usage patterns. For example, if latency begins to rise, consider increasing resource allocation or tweaking batching strategies.

Analyzing historical token usage and prompt performance can reveal ways to optimize batch sizes and throughput. Use monitoring data to refine configurations - if certain models consistently deliver better results, redirecting more traffic to them can improve efficiency. Since PTUs typically scale in a near-linear fashion with call rates under stable workloads, this insight can guide your capacity planning.

Stay updated on Azure OpenAI developments by reviewing release notes and API documentation. New features or model updates could influence your performance strategies. Additionally, conducting quarterly reviews of your architecture - covering deployment strategies, content filtering policies, and workload separation - ensures your monitoring approach evolves alongside your business and technology needs. Keep a record of performance baselines and document adjustments to establish a strong foundation for ongoing improvements. Integrate these reviews into your regular operations for a consistent, data-driven approach to optimization.

Alternative AI Tools for Better Performance

While Azure OpenAI offers a strong foundation, pairing it with complementary tools can take AI performance to the next level. One standout option is NanoGPT, which provides a variety of features designed to enhance flexibility and efficiency.

Overview of NanoGPT

NanoGPT

NanoGPT grants access to over 200 AI models, including popular names like ChatGPT, DeepSeek, Gemini, Flux Pro, Dall-E, and Stable Diffusion. This extensive model library is particularly helpful for testing different approaches before committing to a specific Azure OpenAI deployment strategy.

One of NanoGPT’s standout features is its pay-as-you-go pricing model, which eliminates the need for subscriptions. This makes it a great choice for businesses with fluctuating AI usage patterns. Dan Burkhart, CEO and Co-Founder of Recurly, highlights the value of this pricing approach:

"Pay-as-you-go is paying for products, apps, services, media, and more as they are consumed... Subscribers become more and more valuable the longer they are retained. So this idea of pay-as-you-go in a healthy construct is a nice, clean alignment between the value that is received by subscribers and the price that is being requested to pay for it along a continuum, along a timeline continuum. And that is a healthy construct for a long-standing relationship with customers."

This flexible payment structure is ideal for experimentation, allowing users to evaluate models without the burden of fixed monthly costs. A NanoGPT user, Craly, shared their experience:

"I use this a lot. Prefer it since I have access to all the best LLM and image generation models instead of only being able to afford subscribing to one service, like Chat-GPT."

NanoGPT also prioritizes user privacy by storing conversations locally and ensuring that user data isn’t used for model training. This feature is especially important when dealing with sensitive information during development and testing phases.

Additionally, the platform keeps pace with advancements by adding new models within 1–4 hours of their release. This allows users to quickly experiment with the latest capabilities. For seamless integration, NanoGPT offers both an API and a browser extension, making it easy to incorporate into existing workflows. These features make it a valuable tool for comparing models and refining prompts alongside Azure OpenAI.

George Coxon, who uses NanoGPT for educational purposes, praised its ability to provide access to leading AI models without requiring subscription fees.

NanoGPT’s flexibility makes it particularly useful for proof-of-concept projects, benchmarking AI models, and scenarios that require diverse capabilities without long-term commitments. It offers a practical way to identify the best-performing models for your specific needs before scaling up with Azure OpenAI’s enterprise-grade infrastructure.

By combining NanoGPT with Azure OpenAI, you can unlock new possibilities for improving performance and driving innovation in your AI applications.

Learn more about NanoGPT

Conclusion

Achieving top-tier API performance comes down to smart token management, well-thought-out infrastructure, and consistent monitoring. The key is balancing speed, scalability, and cost-efficiency.

It all starts with token optimization. By crafting concise prompts and setting appropriate max_tokens values, you can cut down on both response times and costs significantly. Small adjustments here can make a big difference.

Your infrastructure setup also plays a critical role. Align deployments with specific workload demands to avoid bottlenecks. For example, separating workloads or batching multiple requests can help minimize unnecessary overhead.

Regular audits and rate limiting are crucial for long-term performance. By reviewing service usage regularly, you can spot inefficiencies early and address them before they escalate into costly problems. Effective rate limiting ensures your system can handle traffic without overloading.

These strategies are backed by real-world data. According to Gartner, 92% of startups leveraging AI reported faster market entry compared to their competitors. Tools like Azure API Management and Azure Monitor offer the infrastructure needed for enterprise-scale deployments. For instance, a multinational corporation utilized regional API gateways in East US, West Europe, and Southeast Asia to minimize latency during regional outages.

Optimization is an ongoing process. Regularly request quota increases, track token usage, and explore support resources to discover new ways to manage costs effectively. Pairing technical optimizations with strategic tools - such as NanoGPT for testing and experimentation - sets the stage for reliable and efficient AI applications. Together, these practices form a strong foundation for sustained, high-performance AI systems.

FAQs

How can I improve performance and reduce costs when using the Azure OpenAI API?

To get the most out of the Azure OpenAI API while keeping costs in check, it's all about smart token management. Cutting down on token usage per request not only speeds up responses but also trims expenses. Here’s how you can make it happen:

  • Combine requests: Instead of making several separate calls, group multiple queries into one. This reduces overhead and streamlines processing.
  • Tweak model settings: Adjust parameters like temperature and max tokens to avoid generating unnecessary or overly verbose outputs.
  • Keep an eye on usage: Regularly monitor token consumption to spot inefficiencies and fine-tune your approach.

By following these steps, you’ll create a smoother, more economical experience with the Azure OpenAI API.

How can I reduce error rates and stay within rate limits when using the Azure OpenAI API?

To reduce error rates and stay within the rate limits of the Azure OpenAI API, try these strategies:

  • Use exponential backoff retries: When you hit rate limit errors, this method increases the wait time between retries step by step, helping to avoid overloading the system.
  • Keep an eye on API quotas and request rates: Azure's quota management tools can help you monitor usage and ensure you don’t exceed your allocated limits.
  • Maintain a steady request flow: Avoid sudden traffic spikes by gradually increasing workloads over time. This helps prevent triggering rate limits.

Following these steps can help keep your API usage smooth and reliable.

How can Azure API Management (APIM) improve the performance and reliability of Azure OpenAI API deployments?

Azure API Management (APIM) boosts the reliability and performance of Azure OpenAI API deployments by offering high availability through options like multi-region or multi-zone deployments. With a 99.99% SLA, it ensures your APIs stay accessible, even during unexpected downtimes.

To enhance performance, APIM provides features such as request load balancing, response caching, and security policies. These tools help manage traffic efficiently, reduce delays, and safeguard your APIs. Together, they enhance uptime, scalability, and responsiveness, delivering a smooth experience for users of Azure OpenAI services.