What Makes Good Synthetic Pretraining Data with Joël Niklaus from Hugginface

today we are taking a deeper look at of the secret behind LLM pretraining which is synthetic data pipelines with Joel Niklaus machine learning engineer at Hugging Face! joel an his team ran 90 controlled experiments and burned over a trillion tokens to figure out what actually makes good pretraining data and in huggingface fashion provided all of their findings/artifacts openly! We’ll check the different structured format they used, the multiple ablations they ran and the counter intuitive outcomes they got out of the result! Will be a fun one (recorded too no worries about it)