It feels unfair I have to pay (or lose some usage) for this.
Interested in other people’s thoughts.
edit: typo
- Define some threshold for bad output
- Detect when a piece of output meets that threshold (vs just maybe not being what the user expected, which in my case is just fine)
- Refund the user credits so they can generate again
Text output is relatively easy to evaluate to some base threshold of quality in the generation process, but the final output is not text... it becomes harder.
Each failed generation would be very disruptive to the user on its own (in the scope of the app's purpose), so I'm also considering offering them an extra discount on their next purchase (in addition to the credit refund).
Do I get users to report generations they consider bad and then review them somehow? Do I try to auto-detect bad output before the output is delivered to the user? Probably a mix of all of the above... while attempting to mitigate the potential for abuse (people making dummy generations and then reporting them 'just to try it out', or try to game the system to get multiple free generations). Maybe I'd have to have some sort of time window for reporting a junk generation, and a max "use" count that flags if the user actually took benefit from the output before reporting it...
I guess this turned into a bit of a brain dump.
If the AI companies just paid all of that out of the goodness of their pocketbook I'd be fine with it, but in reality I think they'd just pass on the costs. The same way that basically every business passes on spoilage, theft, return rates, etc.. So I think the value would be risk mitigation rather than cost (as in, you know if you pay for $10 worth of tokens, it will $10 worth of good tokens, but the individual token cost would need to account for all the tokens that the company doesn't get paid for)
As others have stated too, how do you define what an incorrect output is?
Would you give money back to your employer when you make a mistake?
It's not interesting due to the fact that it suggests humans are still in the loop of some slow-cycle improvements. That'd never get by any board. In fact, selection of model modes implies it's your responsibility, so that meal was scraped into your flowerpot years ago.
I'd say fat chance.
Did it make a mistake because I didn’t follow instructions properly or hallucinated some content?
Did it make a mistake because the prompt was unclear/open to interpretation or plain wrong?
Did it make a mistake because it lacked some context? Or too much context and it starts getting confused?
Is not handling edge cases automatically when that was not requested a mistake?
I am not just trying to defend LLMs, in many cases they make obvious mistakes and just don’t follow my arguably clear instructions properly. But sometimes it is not so clear cut. Maybe I didn’t link a relevant file (you can argue it could have looked to it), maybe my prompt just wasn’t that clear etc
If you choose to use them, you go in knowing they need help to be accurate. You clearly know how to use the tools to reach the accuracy you desire, but asking for that usage to be free seems to be based on a false premise. There has never been an expectation of accuracy in the first place when it comes to LLM output.