The ability to generate a realistic video of a person speaking any text might still sound like science fiction, but it’s reality with Azure AI’s text-to-speech (TTS) avatars. This technology creates photorealistic digital humans that can speak with natural voices in multiple languages. In this article, I’ll dive into what Azure’s AI TTS avatars are, their key capabilities and use cases, how you can customize them, and where they stand compared to other avatar solutions. Along the way, we’ll look at examples of these avatars use cases and discuss why Microsoft’s enterprise security and compliance focus matters – as well as the current trade-offs (like cost) that come with this cutting-edge tech.
I have written about these avatars before, but as these avatars went to GA August 2024, and have gotten new capabilities, now is a good time for update. You can read my previous article here: Photorealistic talking avatars with Azure AI Speech.
What Are Azure AI Text-to-Speech Avatars?Key Capabilities of Azure’s Photorealistic AvatarsUse Case ExamplesCustomization: Your Own AvatarResponsible AI: Safeguards and Ethical UseAzure’s Avatars vs. Other AI AvatarsConclusion
What Are Azure AI Text-to-Speech Avatars?
Azure AI Speech’s text-to-speech avatars are like AI-generated virtual people. You provide text, and the service produces a video of a lifelike human avatar speaking that text in a chosen voice and language. Under the hood, Azure combines its Neural Text-to-Speech engine (which generates the speech audio) with a deep-learning vision model that syncs the avatar’s facial movements to the audio. The result is a 2D photorealistic talking avatar that looks and sounds quite like a real person delivering your content. You can still notice from little things, that it is an AI generated avatar. In my opinion, it is a good thing as the intent is not to use this for deep fakes.
These avatars can be used in two modes:
Batch mode (asynchronous): You input a script (text or SSML) and get back a video file of the avatar speaking. This is great for creating pre-recorded videos (e.g. training materials, announcements).
Real-time mode (streaming): The avatar speaks live in response to text input, suitable for interactive chatbots or live presentations. In real-time mode, the system renders the avatar on the fly with low latency.

Avatars are powered either with a natural-sounding voices from Azure’s text-to-speech library (there is a lot of them), custom neural, or personal voice. Just think about that: you can have the same digital person speak Spanish, Japanese, Finnish, Arabic, or many other languages simply by switching the input text and voice. The voice and the visuals are synchronized for convincing lip sync and even basic facial expressions. Avatars can seamlessly switch languages mid-conversation, enabling truly multilingual presentations and videos.
Do you want to try these avatars? It is easy, as Microsoft offers a web-based Avatar Content Creation tool in Azure AI Foundry Playground to try this out with no code. You can type in text, choose an avatar and voice, and generate a video preview right from your browser. Developers can also integrate the Avatar API into applications using the Speech SDK or REST calls, making it possible to embed these talking avatars into websites, apps, or live chat systems.

These Avatars have a lot of scifi vibes in them, that is why it is easy to be excited about the potential.
Key Capabilities of Azure’s Photorealistic Avatars

Photorealistic human appearance: The avatars look like real humans (not cartoons), with natural facial movements. Avatars are trained on real video footage of people, so they capture details in lip shape and expressions. This realism helps in engaging viewers, as the avatar can convey a friendly or professional demeanor much like a real presenter.
Natural voices and multi-language support: Each avatar can speak in any of the neural voices from Azure’s catalog, covering dozens of languages and regional accents. You can also use neural custom voice and personal voice to make the avatar sound like you. The voice synthesis very good, Microsoft has made clear advancements in TTS.
Pre-built avatars library: Out of the box, Azure provides a collection of pre-made avatar characters you can use immediately. Each comes with a default look and can perform a set of gestures. This gives you a quick way to pick an avatar style that fits your scenario’s tone – whether it’s a friendly tutorial or a corporate announcement.

Custom avatars for branding: For organizations that need a unique virtual spokesperson (for example, an avatar of a specific employee or a brand character), the service supports training custom avatars. This involves providing about 10 minutes of video of a person (with their permission) to create an AI model of their likeness. The custom avatar can then speak with that person’s voice if you also train a custom neural voice, effectively creating a digital twin of a person. This is a powerful feature for a “CEO avatar” or a company spokesperson– imagine your CEO’s avatar delivering a keynote in multiple languages, or a virtual teacher that looks like a real instructor your employees know. However, this capability is gated behind a strict approval process (to prevent misuse), read on to learn more about this.
Real-time interactivity: A futuristic, but already possible to do, use case is interactive chatbots with an avatar face. Azure’s avatars can work with real-time AI– for instance, a customer support bot using Azure OpenAI GPT-4 can output answers that the avatar speaks out loud on a website. The avatar’s lip-sync is generated on the fly, creating the illusion of a live video chat. This opens up more engaging user experiences than plain text or voice alone.
Gestures and expressions: To avoid a “talking head” that’s too static, Azure AI Avatars allows some avatars to perform simple gestures triggered via text tags. Using Speech Synthesis Markup Language (SSML), a creator can insert commands like or specify the avatar’s pose (e.g. pointing, nodding) to make the performance more lively. For example prebuilt “Lisa”, “Harry” and “Meg” avatars has various gestures available. Gestures adds personality and emphasis to key points in the script.
High-quality output: The videos are rendered in 1080p Full HD at 25 FPS by default. It is possible to request outputs with transparent backgrounds (useful for overlaying the avatar on custom backdrops or slides). In real-time streaming, the avatar is delivered as a video stream (H.264). The fidelity is generally sufficient for professional content – you could play these avatar videos on a large projector at an event and they would still look sharp.
Use Case Examples
What can you actually do with these photorealistic avatars? Microsoft and early adopters have highlighted a variety of use cases:

Training and how-to videos: Companies spend lots of time and money filming training content or internal presentations. With TTS avatars, a learning & development team can script a training video and generate the presenter on-demand. This is faster and easier to update than a live shoot. For example, if a procedure changes, you just update the script and regenerate the video with the same avatar. It’s no surprise that enterprise training videos were one of the first scenarios Microsoft mentioned.
Customer service bots with a face: Chatbots and virtual assistants become more engaging when users can see who they’re “talking” to. Azure avatars can serve as virtual customer service agents on websites or kiosks, answering questions with a friendly human face instead of just text bubbles. Bank SinoPac in Taiwan is enabling an avatar to handle customer interactions on their service kiosks, see this in Microsoft’s blog post: Text to Speech Avatar in Azure AI is now generally available.
Marketing and sales: Avatars open up new forms of interactive marketing. Microsoft gave an example of the Microsoft Store on JD.com in China using an AI avatar as a live shopping host. During online sales events, a lifelike avatar could present laptop products, answer viewer questions in real-time, and essentially act as the live streamer. This can drive to higher customer engagement, since viewers could see a “person” demonstrating features and responding, without Microsoft needing to deploy a human host 24/7. The same idea can apply to product demos, tourism (a virtual tour guide), or retail kiosks where an avatar can showcase products dynamically.
Accessibility and content localization: Another powerful use case is making content more accessible. Organizations can take written content – say a company newsletter, a product manual, or a training document – and turn it into an audio-visual clip with an avatar narrator. This is helpful for people who prefer video/audio learning or those who benefit from spoken content. Because the avatars support many languages, the same piece of content can be delivered by the same avatar in multiple languages without reshooting. This kind of localization made easy is a big pro for global companies.
Education and training bots: We could see avatars used as virtual teachers or coaches. Imagine an AI tutor that appears on-screen to teach a language lesson or answer student questions, with a friendly face that can show encouragement. Think about an “AI teacher” who can give an online lesson and then take questions in a conversational style. Because these avatars can be interactive, they could also serve as virtual role-play partners for training – e.g. an avatar acting as a customer in a sales training scenario, responding to what the learner says.
Before jumping to all-out avatars, it’s important to use avatars thoughtfully (nobody wants a fleet of deepfake corporate drones).
Customization: Your Own Avatar
Can we customize the avatar to look or sound like me?
This is possible with custom text-to-speech avatars, which are in limited access. Your own avatar is a custom model trained on footage of the person you want to digitize. Training your own avatar requires about 15 minutes of video of the “avatar talent” as training input, along with that person’s explicit consent to be turned into an avatar. The result is a private avatar model that only your organization can use. If you also provide audio of that person to train a Custom Neural or Personal Voice model, the avatar can use their exact voice, making it extremely realistic.
What is new, is that there are now a Custom Avatar portal available, where you can upload your videos for training and manage the process self-service.

In the portal you can find all information and requirements regarding creating your own avatar. It is important to follow video recording requirements, as poor quality videos will result a poor quality avatar.

It’s important to note that custom avatars currently require an application and approval – there isn’t a self-service button in the portal. Microsoft restricts this because of the obvious ethical implications of cloning someone’s likeness. You have to apply for limited access and have a valid use case. Each custom avatar when deployed lives behind a unique endpoint and incurs hosting fees while it’s running.

There are costs involved with avatars. The model training can take 40-96 hours and there is an hourly cost on this. Endpoint hosting and avatar synthesis also has a price.

For up to date pricing, check out Azure AI Speech Service pricing chart. At the time of writing this, prices are as in the image.

This means, that training a single custom avatar can cost between $600 – $1440 USD. Having the endpoint available costs over $430 USD a month for each model. This price alone tells that this is not meant for casual fun, this is for enterprises who require high-quality avatars that are secure.
Responsible AI: Safeguards and Ethical Use
Any technology that creates “deepfake”-like content raises important questions. Microsoft has put a lot of emphasis on Responsible AI practices in the design of Azure TTS avatars. They are keenly aware of the potential for misuse (e.g. making someone say things they never said, or creating deceptive videos). Here are some of the safeguards and requirements that are in place.

Limited access for high-risk features: As mentioned, to create a custom avatar that looks like a real person, you must go through an application process. Part of that process requires you to submit proof of the person’s consent – a recorded statement where the person (the “avatar talent”) acknowledges their image and voice will be used. Only approved use cases in specific domains (such as education, accessibility, customer service) are allowed for custom avatars, and you must commit to using it only for that purpose when you create your own avatar.
Disclosure and transparency: Microsoft’s guidelines insist that if you deploy an avatar (especially a custom one that might be mistaken for a real human), you should disclose that it’s AI-generated to your audience. This could be a small caption on the video or an introduction that this is a “virtual assistant.”. Microsoft has also adopted the C2PA (Content Provenance and Authenticity) standard to embed information in the avatar videos indicating they were AI-generated.
Invisible watermarks: In addition to metadata, Azure’s system inserts an invisible digital watermark into the output video and audio. This watermark is not perceivable by viewers, but Microsoft and authorized parties can detect it with a special tool. It serves as a hidden signature that the content is synthetic. If someone were to misuse an avatar video, this watermark could help trace it or simply confirm that “yes, this came from Azure’s system.” It’s an interesting security measure to deter malicious deepfakes using the service.
Content safety filters: Azure integrates Azure AI Content Safety checks into the avatar generation pipeline. Essentially, the text that you feed into the avatar will first be analyzed for hate speech, violent or sexual content, self-harm references, etc. If the text is flagged as violating the policy, the avatar will refuse to speak it. This should prevent obvious abuses like making an avatar spout extremist propaganda or harassment. .
Privacy and data handling: Since this service can involve personal likeness and voice data, Microsoft treats that data carefully. Training videos for custom avatars are kept and processed under strict process. The Azure platform itself is built with enterprise-grade compliance (GDPR, ISO 27001, etc.), so companies can use avatars without data leaving the Azure environment. If you use a prebuilt avatar and standard voices, you’re mostly using Microsoft’s own provided assets (no personal data there). But if you use a custom avatar or voice, you should be mindful of the AI ethics around that – and Microsoft’s terms enforce that you only use it for approved scenarios and never to deceive people.
Overall, Microsoft’s approach is to unlock the benefits of this tech (time and cost savings in content creation, improved engagement, accessibility) while mitigating the risks of deepfake abuse. There are a lot of governance in place, such as audit trails, usage guidelines, and technical safeguards like watermarking. This makes Azure’s offering stand out in the market, as many other avatar-generation tools (often consumer-focused startups) might not have such security or robust guardrails.
Azure’s Avatars vs. Other AI Avatars
With the rise of synthetic media, Azure isn’t the only player in the talking avatar space. Azure’s TTS avatars shine for enterprises that prioritize security, want tight integration with Azure’s AI stack, and possibly need the realism of a custom-trained avatar with a custom voice. Competing avatar generators shine for quick, easy video creation with a lower learning curve and usually a lower cost. Azure provides more oversight and guarantees around responsible use, whereas others put more weight on the user to use the tool ethically. The choice may come down to whether you’re an enterprise with stringent compliance needs or a content creator who just wants a handy AI video tool.
Conclusion
Photorealistic AI avatars are still an exciting development at the intersection of speech and vision AI. There are still some “uncanny valley” moments (especially if you scrutinize the mouth movements), but for everyday business content, they are quite good enough. And the ability to instantly switch languages or update the script makes them practical for global communication.
In the end, Azure AI avatars is a reminder of how fast the future is arriving. It also challenges us to blend creativity with responsibility. For content creators and developers, it’s an opportunity to re-imagine how we produce videos and interact with users. For organizations, it raises new policy questions (do we need an “AI avatar ethics” guideline?). And for audiences, it will undoubtedly become a normal part of the media we consume. As someone passionate about the future of work and AI, I see Azure’s photorealistic avatars very interesting and it is already here to use. The tech is here and maturing; now it’s up to us to help customers to come up with valuable use cases.
Where and how you would use photorealistic avatars?

Check out Microsoft’s article Text to Speech Avatar in Azure AI is now generally available.
Did I use AI to help me write this one? Of course! The Deep Research was very helpful in creating the first draft which I then edited further.
Published by
I work, blog and speak about Future Work : AI, Microsoft 365, Copilot, Loop, Azure, and other services & platforms in the cloud connecting digital and physical and people together.
I have 30 years of experience in IT business on multiple industries, domains, and roles.
View all posts by Vesa Nopanen