- Didier's Newsletter
- Posts
- Turn my blog feed into a QA dataset to fine-tune a LLM
Turn my blog feed into a QA dataset to fine-tune a LLM
A step-by-step guide to transforming blogs into AI-ready datasets

This project converts blog feed content into a structured Question-Answer dataset using LLaMA 3.2 (via Ollama) for local processing. The generated dataset follows a conversational format and can be automatically pushed to Hugging Face.
The open-source code is available here.
I was looking to fine-tune an open-source LLM with content that I have produced to see how advanced such LLMs are and how close I could get to getting a model running locally to "output tokens" the same way I would.
According to Daniel Kahneman and his book Thinking, Fast and Slow, humans have two modes of thought:
System 1: Fast, instinctive, and emotional. An example of this is my posts on X.
Multiple libraries are out there to scrape data from X. One that I used recently, and liked (without requiring an X API key) was Twitter scraper finetune from ElizaOS.
System 2: Slower, more deliberative, and more logical. My blog is an example of this. Some posts take me several hours to write, and I need to sleep on the topic before pushing.
For this, I didn't find any good out-of-the-box library that allowed me to convert my posts into a QA dataset to fine-tune a model.
So this is what I ended up building.
Getting Started
To do this you will need:
Python 3.11
Poetry (for python dependencies)
Ollama (to run Llama 3.2)
Hugging Face account (for dataset upload)
And your blog in a JSON feed like https://didierlopes.com/blog/feed.json.
1. Install dependencies
poetry install
poetry run python -m spacy download en_core_web_sm
2. Install Ollama and pull Llama 3.2
Follow instructions to install Ollama: https://ollama.com/
Select a model to run locally using https://ollama.com/search.
In this case, we want to run llama3.2:latest
(https://ollama.com/library/llama3.2).
ollama pull llama3.2:latest

Then, we can check that the model has been downloaded with:
ollama list

Finally, we can test that it works with:
ollama run llama3.2:latest

3. Configure Hugging Face
Create a write-enabled token at Hugging Face
Create a .env file:
HF_TOKEN=your_token_here
Usage
1. Update the blog feed URL in this notebook.
Below you can see the feed structure being used - which is the default coming from Docusaurus, which is the framework I'm using to auto-generate the feed for my blog.
url = "https://didierlopes.com/blog/feed.json"
JSON Feed Structure
{
"version": "https://jsonfeed.org/version/1",
"title": "Didier Lopes Blog",
"home_page_url": "https://didierlopes.com/blog",
"description": "Didier Lopes Blog",
"items": [
{
"id": "URL of the post",
"content_html": "HTML content of the post",
"url": "URL of the post",
"title": "Title of the post",
"summary": "Brief summary of the post",
"date_modified": "ISO 8601 date format",
"tags": [
"array",
"of",
"tags"
]
},
// ... more items
]
}
2. Set your Hugging Face dataset repository name:
dataset_repo = "didierlopes/my-blog-qa-dataset"
This is what the dataset will look like in HuggingFace: https://huggingface.co/datasets/didierlopes/my-blog-qa-dataset/viewer.

3. Run the notebook cells sequentially.
The notebook contains detailed explanations throughout to guide you through the process step-by-step.
Dataset Format
The generated dataset includes:
title: Blog post title
conversation: Array of Q&A pairs in role-based format
context: Original cleaned blog content
url: Source blog post URL
date: Publication date
Note: This is the format of the conversation field:
conversation = [
{
"role": "user",
"content": (
"You mentioned that when ChatGPT launched, everyone rushed to build "
"financial chatbots. What were some of the fundamental truths that "
"those who built these chatbots missed?"
)
},
{
"role": "assistant",
"content": (
"Those building financial chatbots missed two fundamental truths:"
"1. AI models are useless without access to your data."
"2. Access to data isn't enough - AI needs to handle complete "
"workflows, not just conversations."
"These limitations led to chatbots that can't access proprietary "
"data, can't handle complex workflows and restrict analysts to an"
"unnatural chat interface."
)
},
# ... more Q&A pairs following the same pattern
]
Summary of how it works
Fetches blog content from JSON feed
Cleans HTML to markdown format
Analyzes sentence count to determine Q&A pair quantity
Generates contextual questions using LLaMA 3.2 running locally
Creates corresponding answers
Filters and removes duplicate Q&A pairs
Formats data for Hugging Face
Pushes to Hugging Face Hub