You have already found just about the perfect format with which to hear imaginary voices.
The recording system you are using makes low quality sound recordings with plenty of noise, flaws and artefacts, a portion of which somewhat resemble speech-like sounds.
If you made a recording in noisy surroundings you would hardly notice those flaws. You have learned to make recordings in very quiet surroundings then turn the volume right up on replay so that what you hear is the greatly amplified strange coding artefacts distorting the small amount of real live sound which was available to be recorded.
Anything you do to try to improve your recordings will reduce those flaws and hence make the 'voices' disappear. I appreciate that what you want to achieve is a way to enhance the voices so that you can make out their words, but I'm afraid that can't be done since they aren't voices and there aren't really any words to enhance.
Sorry to be the bearer of disappointing news, but that's just the way it is.