You would only use the base model during training. This is a distillation technique