Okay, I guess I need to revise my understanding.
The way I understand it, Nyquist tells us that in order to perfectly reconstruct a signal, we need a sample rate that’s at least twice the highest frequency in the signal. And we need to leave enough room between that Nyquist rate and our sample rate to build an artifact-free filter that will knock down higher, alias-causing frequencies that would otherwise show up at more than half of our sample rate. IIRC, @FractalAudio posted that ideally, that “magic number” is somewhere around 62 KHz — that would accommodate a 20 KHz bandwidth, which is the audible spectrum, whether there’s distortion or not.
As I understand it, further oversampling can be used for noise reduction via sample averaging, where uncorrelated signal components (noise) would not add as strongly as the desired correlated signal.
I am open to re-education.
I think the issue with sample rates and modelling distortion is less the audio quality aliasing perspective like you're thinking (where you need a rate high enough to record, do simple processing, then play back without aliasing) and more like the issue of significant figures in mathematical calculations. This is because modelling the clipping, and especially cascading clipping, involves calculating distortion on top of distortion on top of distortion, etc for each stage.
If you do the whole calculation at the normal sample rate, then each level of distortion is "rendered" into the sample rate, then the next stage has to calculate against that last stage. But even if the aliasing isn't audible, it's there to cause effects in the next stage of distortion. So by oversampling, you make sure that each stage of clipping is at a much higher resolution so no details are lost until the whole thing is done, then it's brought back to the standard rate.
Since all this audio processing is just math, it's pretty similar to if you round after every calculation or just once at the end:
12.5 / 3 = 4.166666666667 * 2.1 = 8.75 / 2 = 4.375 * 4 = 17.5 * 5 = 87.5 / 3 = 29.166667.
12.5 / 3 ~= 4.2 (4.166666667)
4.2 * 2.1 ~= 8.8 (8.82)
8.8 / 2 = 4.4
4.4 * 4 = 17.6
17.6 * 5 = 88
88 / 3 = 29.3 (29.3333333)
This is obviously extremely contrived, and if I tweaked the numbers the difference could be made larger or smaller. But I think that's the core concept: If you do all your gain stage clipping calculations at the normal rate, each step is "rounded" into the sample rate. If you oversimple you keep additional detail do these differences don't become as pronounced.