I haven't found any evidence of the individual program counter thing being true beyond one niche application: Running mutexes for a single vector lane, which is not a performance optimization at all. In fact, you are serializing the performance in the worst way possible.
From a hardware design perspective it is completely impractical to implement independent instruction pointers other than maybe as a performance counter. Each instruction pointer requires its own read port on the instruction memory and adding 32, 64 or 128 read ports to SRAM is prohibitively expensive, but even if you had those ports, divergence would still lead to some lanes finishing earlier than others.
What you're probably referring to is a scheduler trick that Nvidia has implemented where they split a streaming processor thread with divergence into two masked streaming processor threads without divergence. This doesn't fundamentally change anything about divergence being bad, you will still get worse performance than if you had figured out a way to avoid divergence. The read port limitations still apply.