My Journal
This section will be my day to day, week to week log of anything I learn, try, and fail on.
8/3/2024, 10:24:39 AM - Week 3 of My LLM Journey
To be honest, I didn't work much on the application at work other than going through older Teams messages with the interns to try and sink older responses that we could automatically extract using the API because of limitations.
Unrelated to work, I also started looking at extracting the data for my personal site (currently stored in sanity.io) to see if I could build a "DaveBot" to help answer questions about my experience. I was able to get the data out pretty quickly using the API and running a groq query, but now I'm thinking through what the best approach would be (seems like a good application for a graph). I'll continue to document anything interesting I find.
7/31/2024, 4:44:22 AM - TIL: ADF Data Flow Expression Language is Pretty OK
I've used Azure Data Factory (ADF) Data Flows a decent amount, but didn't generally need to do a whole lot in the Data Flow expression language other than really simple things like a filter or adding a date column. To minimize our dependence on an Azure SQL database, I'm working to move logic out of SQL into the dataflow. I considered using Synapse, but there was already a good bit of logic in the Data Flow, so this was the quickest way to get the win we needed (and it's likely we'll have a more sweeping architecture change once Fabric is available for our use).
I was pleasantly surprised how easy it was to translate the logic from SQL to the Data Flow, including some things that are easier (lpad!). Would I prefer a consistent language between all the microsoft things like F# or Python? Sure, but this one is far from the worst I have to work with 🙂.
7/27/2024, 2:44:50 AM - TIL: Teams Uses HTML to Format
While I was playing with the process for our bot to response, I was bothered by the way the markdown output from the model looked in the reply. I did some digging and tests and found that if you send html and post a reply and Teams will format it.
7/26/2024, 11:45:23 PM - Week 2 of My First LLM Journey
I came into the week with the data I needed, additional unexpected capacity, and an abundance of enthusiasm. A couple of my colleagues had already done a good bit of leg-work on setting up chatbots (which is a comparable application to what I'm targetting), so I started by looking at their code to see if I could groq how the tools worked.
I was successful in creating and persisting a ChromaDB database with the key data to use as context as metadata, and the information to find related data as the embedding.
I then moved on to trying to use the DB to add context to the request. I tried for a few hours to get something working from the example I had unsuccessfully. The process seemed to be providing the embedded data as the context VS the particular metadata I was interested in.
After I got tired of banging my head against the desk, I gave my colleague (the original author of the code) a call to discuss where I was at. He pointed me to a couple articles that were very helpful.
- https://python.langchain.com/v0.2/docs/concepts/
- https://python.langchain.com/v0.2/docs/tutorials/rag/
Although I haven't made it all the way through these resources, I was able to understand better what was going on and how LangChain worked and how to extract and provide the context I needed. Although I definitely need to improve the prompt and what data gets used for context, I was able to get reasonable results.
From here, it was a race to a solution that could work in the context I wanted (trigger on keyword from teams, process response based on parent message, and response to parent message). I ended up with roughly the following for the query process:
- User types in key word, triggering a Powerautomate workflow that kicks off an Azure Data Factory pipeline
- The Azure Data Factory pipeline kicks off a synapse notebook, which loads the ChromaDB file from a storage account and sinks the response to a database
- Inserting a new item in the database triggers another Powerautomate workflow that writes the message back to teams
Tape taped, ugly, and full of latency, but I was able to crank it out in a few hours without creating any new resources (which in most cases we do not have the access to do ourselves). It should be good enough for me and the team to play with and tweak the retriever, prompt, add context, etc to see if the product is worth productionizing and multiplying.
One of the largest sources of latency is that the flow to trigger on new entries only checks the database once every 5 minutes. I went this route because I knew there wouldn't be any issues with authorization because our synapse is already writing to the database. In hindsight, I had to get the storage account set up for the DB anyway, so that may have provided a faster trigger (though I'm not sure if Powerautomate can do blob creation triggers or not).
7/20/2024, 1:03:53 PM - Week 1 Of My First LLM Journey
I have been passively gathering information about how to apply large language models. I'd get a little bit of information here, some other information there, all compiling to a high level knowledge of how they work.
There was an application for our team at work that I thought would be a great starting point. It served our team, so I had the business expertise I needed. Also, the benefit if it works is high, and the impact if it doesn't is low.
I came upon some unexpected capacity, so I decided I'd take a stab at this applicaiton. I was able to get the data required (MS Teams Channel messages and replies) to a storage account, and got the data pulled into a local environment. Next steps, figure out langchain.