Glossary

What Is Speech-to-Text (STT)?

Speech-to-Text (STT) is technology that converts spoken audio into written text. It listens to a person talking, then transcribes what they said into words a computer can read and act on.

Speech-to-Text (STT)

You've used Speech-to-Text whenever you tap the microphone on your phone keyboard and watch your words appear as you talk. The software captures the sound of your voice, breaks it into small pieces, and matches those pieces to words. The result is plain text. People also call it automatic speech recognition, or ASR, and the two terms mean roughly the same thing.

Here's a concrete example. A customer calls your shop and says "What time do you close on Saturday?" An STT system turns that sentence into the text "what time do you close on Saturday." Now a computer program can read it, look up your hours, and reply. Without that text step, the software would just have raw audio it can't search or understand.

STT is the first half of any voice assistant or phone bot. Voice usually moves through three stages: STT writes down what the caller said, an AI model figures out what they want and writes an answer, and then Text-to-Speech reads that answer back out loud. So when someone talks to a voice agent on a website or over the phone, STT is the part doing the listening.

Accuracy depends on a few things. Background noise, strong accents, mumbling, and industry jargon all make the job harder, so a transcript won't always be perfect. Most modern systems handle clear speech well and keep improving with more training data. Many also let you add a custom word list, which helps with product names or terms specific to your business.

For a small-business website, the practical payoff is letting people speak instead of type. A visitor can ask a question out loud and a chat or voice agent answers, which is handy on a phone or for anyone who finds typing slow. STT runs quietly in the background to make that possible.

Frequently asked questions

What's the difference between Speech-to-Text and Text-to-Speech?+

They're opposites. Speech-to-Text takes spoken audio and writes it out as text. Text-to-Speech takes written text and reads it aloud in a synthetic voice. A full voice agent uses both, one to understand the caller and one to reply.

Is Speech-to-Text the same as voice recognition?+

Not quite. Speech-to-Text figures out the words a person said. Voice recognition, sometimes called speaker recognition, identifies who is speaking by the sound of their voice. People often mix up the terms, but they answer different questions.

How accurate is Speech-to-Text?+

Good systems transcribe clear speech with high accuracy, often above 90 percent of words correct. Accuracy drops with background noise, heavy accents, or uncommon jargon. Adding a custom word list with your product or business terms usually helps.

Launch your AI voice & chat agent today

Build an agent trained on your business in minutes. Free to start, no credit card, install on any website.