Поиск ответов на вопросы с помощью AutoTrain
Extractive Question Answering (QA) enables AI models to find and extract precise answers from text passages. This guide shows you how to train custom QA models using AutoTrain, supporting popular architectures like BERT, RoBERTa, and DeBERTa.
What is Extractive Question Answering?
Extractive QA models learn to:
- Locate exact answer spans within longer text passages
- Understand questions and match them to relevant context
- Extract precise answers rather than generating them
- Handle both simple and complex queries about the text
Preparing your Data
Your dataset needs these essential columns:
- text: The passage containing potential answers (also called context)
- question: The query you want to answer
- answer: Answer span information including text and position
Here is an example of how your dataset should look:
{ "context":"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.", "question":"To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?", "answers":{ "text":[ "Saint Bernadette Soubirous" ], "answer_start":[515] } } { "context":"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.", "question":"What is in front of the Notre Dame Main Building?", "answers":{ "text":[ "a copper statue of Christ" ], "answer_start":[188] } } { "context":"Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend \"Venite Ad Me Omnes\". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.", "question":"The Basilica of the Sacred heart at Notre Dame is beside to which structure?", "answers":{ "text":[ "the Main Building" ], "answer_start":[279] } }
Note: the preferred format for question answering is JSONL, if you want to use CSV, the answer column should be stringified JSON with the keys text and answer_start.
Example dataset from Hugging Face Hub: lhoestq/squad
P.S. You can use both squad and squad v2 data format with correct column mappings.
Training Options
Local Training
Train models on your own hardware with full control over the process.
To train an Extractive QA model locally, you need a config file:
task: extractive-qa base_model: google-bert/bert-base-uncased project_name: autotrain-bert-ex-qa1 log: tensorboard backend: local data: path: lhoestq/squad train_split: train valid_split: validation column_mapping: text_column: context question_column: question answer_column: answers params: max_seq_length: 512 max_doc_stride: 128 epochs: 3 batch_size: 4 lr: 2e-5 optimizer: adamw_torch scheduler: linear gradient_accumulation: 1 mixed_precision: fp16 hub: username: ${HF_USERNAME} token: ${HF_TOKEN} push_to_hub: true
To train the model, run the following command:
$ autotrain --config config.yaml
Here, we are training a BERT model on the SQuAD dataset using the Extractive QA task. The model is trained for 3 epochs with a batch size of 4 and a learning rate of 2e-5. The training process is logged using TensorBoard. The model is trained locally and pushed to the Hugging Face Hub after training.
Cloud Training on Hugging Face
Train models using Hugging Face’s cloud infrastructure for better scalability.
As always, pay special attention to column mapping.
Parameter Reference
class autotrain.trainers.extractive_question_answering.params.ExtractiveQuestionAnsweringParams
( data_path: str = None, model: str = 'bert-base-uncased', lr: float = 5e-05, epochs: int = 3, max_seq_length: int = 128, max_doc_stride: int = 128, batch_size: int = 8, warmup_ratio: float = 0.1, gradient_accumulation: int = 1, optimizer: str = 'adamw_torch', scheduler: str = 'linear', weight_decay: float = 0.0, max_grad_norm: float = 1.0, seed: int = 42, train_split: str = 'train', valid_split: typing.Optional[str] = None, text_column: str = 'context', question_column: str = 'question', answer_column: str = 'answers', logging_steps: int = -1, project_name: str = 'project-name', auto_find_batch_size: bool = False, mixed_precision: typing.Optional[str] = None, save_total_limit: int = 1, token: typing.Optional[str] = None, push_to_hub: bool = False, eval_strategy: str = 'epoch', username: typing.Optional[str] = None, log: str = 'none', early_stopping_patience: int = 5, early_stopping_threshold: float = 0.01 )
Parameters
- data_path (str) — Путь к набору данных.
- model (str) — Название предварительно тренированной модели. По умолчанию используется “bert-base-uncased”.
- lr (float) — Скорость обучения для оптимизатора. Значение по умолчанию 5e-5.
- epochs (int) — Количество периодов обучения. Значение по умолчанию 3.
- max_seq_length (int) — Максимальная длина последовательности для входных данных. Значение по умолчанию 128.
- max_doc_stride (int) — Максимальный шаг документа для разделения контекста. Значение по умолчанию 128.
- batch_size (int) — Размер пакета для обучения. По умолчанию 8.
- warmup_ratio (float) — Коэффициент прогрева для планировщика скорости обучения. Значение по умолчанию равно 0.1.
- gradient_accumulation (int) — Количество шагов накопления градиента. Значение по умолчанию равно 1.
- optimizer (str) — Тип оптимизатора. По умолчанию используется “adamw_torch”.
- scheduler (str) — Тип планировщика скорости обучения. По умолчанию используется “linear”.
- weight_decay (float) — Уменьшение веса для оптимизатора. Значение по умолчанию равно 0.0.
- max_grad_norm (float) — Максимальная норма градиента для отсечения. Значение по умолчанию 1.0.
- seed (int) — Выборочное значение для воспроизводимости. Значение по умолчанию 42.
- train_split (str) — Название раздела обучающих данных. По умолчанию используется “train”.
- valid_split (Optional[str]) — Название раздела проверочных данных. По умолчанию - None.
- text_column (str) — Имя столбца для контекста/текста. По умолчанию используется “context”.
- question_column (str) — Название столбца для вопросов. По умолчанию используется “question”.
- answer_column (str) — Название столбца для ответов. По умолчанию используется “answers”.
- logging_steps (int) — Количество шагов между регистрациями. По умолчанию 1.
- project_name (str) — Имя проекта для выходного каталога. По умолчанию используется “project-name”.
- auto_find_batch_size (bool) — Автоматически определяет оптимальный размер пакета. Значение по умолчанию False.
- mixed_precision (Optional[str]) — Смешанный режим точной тренировки (fp16, bf16 или нет). По умолчанию None.
- save_total_limit (int) — Максимальное количество сохраняемых контрольных точек. Значение по умолчанию 1.
- token (Optional[str]) — Токен аутентификации для Hugging Face Hub. Значение по умолчанию - None.
- push_to_hub (bool) — Следует ли перемещать модель в положение Hubg Face Hub. Значение по умолчанию - False.
- eval_strategy (str) — Стратегия оценки во время обучения. Значение по умолчанию - “epoch”.
- username (Optional[str]) — Для аутентификации имя пользователя на Hugging Face. Значение по умолчанию - None.
- log (str) — Метод ведения журнала для отслеживания эксперимента. Значение по умолчанию - “none”.
- early_stopping_patience (int) — Количество эпох без улучшений для ранней остановки. Значение по умолчанию 5.
- early_stopping_threshold (float) — Порог для ранней остановки улучшения. Значение по умолчанию 0,01.
ExtractiveQuestionAnsweringParams