Building with Assembly AI's LeMUR
State of the LLM Space
The generative AI wave is in full swing but it is not easy to develop with a new kind of technology. There are several challenges:
Not all data is textual - A significant chunk of data is in data formats such as PDFs or audio which are not easily ingestible by large language models.
Technical complexity - LLM-powered applications started simple. In December of 2022, tech Twitter was flooded with people experimenting with the GPT API and building very simple personal tools that replicated a form of the ChatGPT experience. Since then, complexity has increased exponentially. We now have multiple data formats like PDF, JSON, audio/video, etc, tooling to deal with context window limitations like RAG, orchestration tools like Langchain, security services like Mithril, and much more. LLM-powered applications are not just a continuation of existing development but are fostering a new technology stack.
Engineering bandwidth - The vast majority of organizations see LLMs as a major value center but do not have the talent or bandwidth to actually develop and deploy these applications. With increasing complexity as noted above, this problem will get worse.
Assembly AI: Pioneering Speech-to-Text Transcription and Beyond
Assembly AI made its name as one of the best providers of transcription models. Transcription models are models that ingest audio and output the spoken words in a textual format. Since its first model, the company has only improved and currently provides the best publicly available speech-to-text model in existence: Conformer 2. Speech data is a major component of enterprise data - usually generated from employee communication and customer service - but is one of the hardest to access.
There are no easy automated reasoning tools for speech data but with the rise of LLMs, there is a way to automatically reason with text data. Assembly has not missed this. Their models were already solving problem 1 in the AI space. Given that Assembly is the hub for speech data and owns state-of-the-art models in speech, the company realized the opportunity to solve problems 2 and 3 noted above and deliver more value to its clients (and become stickier). This is through its new offering: LeMUR. Imagine a service that takes in your audio data, transcribes it, and supports LLM-powered features like summarization, Question-Answering, and extracting action items and highlights - all through an API that removes any need for orchestration or security management. That is LeMUR.
A test in building with LeMUR
I do not like most lectures. They are usually too long, boring, and it’s hard for me to maintain attention for more than an hour. Many have built solutions where they transcribe popular podcasts or lectures, vectorize them, and integrate them with an LLM to produce factual responses to user queries. But what if you can simply hand your lectures to LeMUR and let it do all the heavy lifting? So I tested it out by transcribing the first three lectures of Aswath Damodaran’s Valuation class* and put it in the LeMUR API. You can test it out yourself with a simple webpage I built here: Valuation Companion. Some observations:
LeMUR has a pricing model based on total input and output tokens. This means you pay for what you use and only what you use. This is a good thing.
Most LLM applications use a database to search for relevant information and only process that into the LLM to save costs. LLMs are expensive to run, after all. LeMUR has no such system and pushes the entire transcript to produce your output. For long transcripts such as lectures (such as my project), this can create an enormous amount of unnecessary token burn and some spectacular bills. I sincerely hope Assembly fixes this by at least providing the option to use a retrieval solution.
The model context window is very large. Context windows are how many tokens the model can “remember” before it discards that info - the info will have no bearing in the output outside the window. For example, if a model has a 1000-token window and you feed it 1100 tokens, it will effectively discard the first 100 tokens. LeMUR has a breathtakingly long context window of 1 million tokens. Assembly AI claims this corresponds to roughly 100 hours of audio. This is a major plus for me.
The generative model needs improvement. My project uses the basic version of the model which is cheaper and less performant in terms of using the right information and producing quality output. But during testing, I found that even the full version (‘default’ label in the docs) is less coherent and capable compared to models like GPT-3.5.
You can only use transcripts generated by Assembly AIs APIs. I get why I cannot use my own LLM or retrieval solution on a plug-and-play basis - it would defeat the point of the service and I am better off building my own solution. But why force me to use transcripts generated by a particular service?
Overall, I loved working with LeMUR and am a fan of Assembly AI’s services. I look forward to improvements to LeMUR and the rest of the product suite.