Why You Should [Not] Fine-Tune on Synthetic Data

Name: Why You Should [Not] Fine-Tune on Synthetic Data
Start: 2024-12-11T17:30:00+01:00
End: 2024-12-11T19:30:00+01:00
Location: Impact Hub Brno

Hosted By

Markéta A. and 2 others

Why You Should [Not] Fine-Tune on Synthetic Data

Details

Speaker:
Roman Grebennikov

Description
Custom task-specific LLMs offer significant benefits in terms of privacy (they can be run locally), costs (eliminating per-request API fees), and quality (optimized for your specific business problem). Building such a model with existing tools is straightforward—if you have enough training data. However, in practice, you often don't.In this talk, we'll share the story of how we built a synthetic training data generation tool for the open-source search engine Nixiesearch. We'll use the open ESCI dataset and explore how much we can improve search relevance with synthetic training data in a practical use case. Does this approach even work? Is it a viable low-cost alternative to proper fine-tuning on explicit labels? How much does the LLM prompt matter? We'll compare OpenAI, LLama3, and a custom-made model, discussing all the challenges and pitfalls we encountered during the project.

This time we will not be streaming.

Program:
17:30 Welcome chat
18:00 Talk
18:50 Discussion
19:10 Networking (Impact Hub)

About MLMUs:
Machine Learning Meetups (MLMU) is an independent platform for people interested in Machine Learning, Information Retrieval, Natural Language Processing, Computer Vision, Pattern Recognition, Data Journalism, Artificial Intelligence, Agent Systems and all the related topics. MLMU is a regular community meeting usually consisting of a talk, a discussion and subsequent networking. Except of Prague, MLMU also spread to Brno, Bratislava and Košice.

Events in Brno, CZ Artificial Intelligence Applications

Deep Learning Artificial Intelligence Machine Learning Neural Networks