Please...Any Audio Programmers Out There? Question Re Pitch Detection

jesussaddle

Power User
I need to detect pitch fundamentals from music and use the pitch's (as accurate as possible) pitch and amplitude envelope info. I was discouraged recently to see a couple of Melodyne Audio to MIDI videos that seem to show that this problem, for polyphonic music, is still very far from being solved. The harmonics should all be visible as plotted on an FFT Graph, such as here:

https://www.seventhstring.com/xscribe/PianoRoll.png

But evidently in polyphonic material of a single recorded instrument with no reverb or delay, getting an idea where the notes begin, such as from an instrument with what could be presumed to be fairly obvious transients, seems still to be a challenge.



See 00:32



Mr. Russell uses the phrase "Easily and accurately", but then Mr. Russell fails to actually play the result of the polyphonic detection, instead he plays the audio recording as if its the result. Then a couple of viewers in the comment section point this out, and say that Melodyne basically does not yet do an accurate job. Missed notes, and even placement of notes at significantly inaccurate times - the latter being the most problematic for my needs. My experience with such programs, including WIDI Pro, agree with the viewer comments - this problem, for a simple piano track, isn't even mostly solved (much less a full band or a piece of music with fx processing - which I don't even care about for this post.)

I'm not knowledgeable by any means, but I have a theory on how to solve this - not in anything near a real time approach - but at least to solve it using significant software and hardware resources. Is there anyone out there that has an inkling of the calculation methods involved, that I could maybe bounce my idea off of?
 
Last edited:
What is the question? Pitch detection is a difficult problem. Polyphonic pitch detection is orders of magnitude more difficult. Real-time pitch detection is much more difficult than non-real-time. Polyphonic real-time pitch detection is nearly impossible.
 
@FractalAudio, is this essentially a mathematical issue, e.g. pitch ambiguity due to both single note overtones and multiple note inter-modulation (especially with a small sample window)? Or is it more about processing speed limitations? Or both, or other...?

It is curious that some (perhaps rare) humans can identify every note in a complex chord, sometimes with perfect pitch (though not within milliseconds).
 
I'm weak on the theory and the math. I don't need to perform real time polyphonic pitch detection. But non-real time polyphonic pitch detection, and near real time monophonic detection, would be extremely helpful to a project I'm working on. A couple of days ago the game programmer I am working with decided he believed that that polyphonic pitch detection was doable using the Unreal audio spectrum analyzer. So I spent a couple of days digging around for references to show him that its a very difficult and as yet unsolved problem. I showed him that even Melodyne is unable to consistently do a clean job of Audio to MIDI conversion, even on a piano recording... (He's taking a much deserved break from my comments now... )

Because my brother has been in the tech field on parallel processor designs, and did a bit of analog to digital phone satellite stuff, a long time ago I thought of approaching the problem by first isolating the exact point when each repetitive partial begins. I thought of running the sampling using a whole series of out of phase sampling processes, and then comparing them to see where the earliest point is where each repetitive partial is detected. I realize for lower frequency material that even missing one cycle can add milliseconds of error.

My question has to do not so much with errors of pitch, but errors of timing, especially in the lower frequencies. I imagine that some errors involve either missing the beginning of, or missing the contiguousness of the pitch variation of a set of related partials (so dropping out and picking up again when the same pitch event, albeit modulating in pitch, is audible to a listener). Possibly other errors have to do with missing this contiguousness due to the complexity of varying timbre and pitch envelopes of polyphonic audio material.

I literally have no comprehension of the processes or math. But it seems to me that we could partly eliminate the timing errors by FFT/sampling with a whole series of out-of-phase threads; and then comparing them, to first attempt to get the most accurate note onsets we can (and basically percussive onsets too). Then I can allow settings to suppress timing errors based on a resulting precisely detected pulse map. (I.e. suppress events that deviate too much, as a preference, from the tempo also I suppose, and I can prevent the selection of fundamentals that sound too Mike Sternish unless the Sternal setting is engaged. Kidding of course.

I would try and trace the fundamentals within range of the harmonic instruments, and leave bells to someone else.

But my question is if I have unlimited time to process, and unlimited processing power, is this way of thinking reasonable to someone who understands the processes?
 
If you "literally have no comprehension of the processes or math" you're destined for failure.

Pitch detection is a very advanced topic. If you use an FFT then you will inherently destroy the "timing" because an FFT has no time information. In signal processing we typically use the STFT (short-time Fourier transform) which operates on a frame of data. Your time "ambiguity" is the length of the frame. So just use a very short frame you say. Well then you increase the frequency ambiguity because the frequency resolution is proportional to the frame length.

It's my belief that the ultimate technique would be to use wavelets with the basis set based on something that resembles a short plucked guitar string. However even the FWT is extremely CPU intensive.

Things like Melodyne create markers that determine the start and end of the sound. Then they use advanced (very advanced) signal processing techniques to try to figure out the frequencies in that window.
 
What is the question? Pitch detection is a difficult problem. Polyphonic pitch detection is orders of magnitude more difficult. Real-time pitch detection is much more difficult than non-real-time. Polyphonic real-time pitch detection is nearly impossible.
I remember reading an article a few years ago where a university researcher was applying cutting edge techniques to determine exactly which notes where being played at the start of some Beatles song where the bass, guitar, and piano were all playing at the same time. Wish I could remember the details.

The result? No one could really agree if he was right or not. It really seems like it should be an easy problem, but oh my goodness it isn't.
 
Within the field of psychoacositics and human auditory perception there are significant variations with regards to frequency and temporal resolution, format and harmonic detection et al, even amongst normal hearing listeners and even greater variation with impaired listeners. A lot of research has been done, primarily with applications towards cochlear implant processing. Big problem is the variability in perception of the listener as to what is perceived improvements and what can’t be detected by a majority, which then creates the question not so much can we do soemthing, but one of will it have a practical application most users would notice.

Same issue pertains to audio processing; even if it could be accomplished, let’s say in a non-real-time manner, what practical applications would it have, and more so, what monetary rewards justify the outlay of research into the field ?
 
Jam Origin MIDI Guitar does a remarkably good job of doing polyphonic pitch detection. It's not flawless but if you can play very cleanly, it will do a very usable job of producing accurate MIDI data pretty quickly.
 
Within the field of psychoacositics and human auditory perception there are significant variations with regards to frequency and temporal resolution, format and harmonic detection et al, even amongst normal hearing listeners and even greater variation with impaired listeners. A lot of research has been done, primarily with applications towards cochlear implant processing. Big problem is the variability in perception of the listener as to what is perceived improvements and what can’t be detected by a majority, which then creates the question not so much can we do soemthing, but one of will it have a practical application most users would notice.

Same issue pertains to audio processing; even if it could be accomplished, let’s say in a non-real-time manner, what practical applications would it have, and more so, what monetary rewards justify the outlay of research into the field ?
The monetary rewards are there with my technology because there are mathematical correlations between how we perceive music (see the video "missing fundamental" on youtube) and how we perceive other phenomena - the correlation apparently exists - although reproduction at a commercial level is a challenge. I've been researching this and talking to experts for quite awhile - in academia and industry. Generally the response has been "Yes, that's something worth researching". Its hard to find someone with the "general grasp" to understand the cost/benefit equation. Non-German Western music theory has certain holes - called Helmholz :D (Sorry, my critique of Western music theory is that you can conceive of chords as stacked thirds - and the chord "root" as simply being the starting point for stacking (an obviously useful definition). But the term "root" in the German apparently meant "fundamental" - in other words not just the bottom. [I think the language where it meant "fundamental" was German but now I'm not sure - Wikipedia removed the paragraph describing this scenario, and my memory is not clear on it.] In terms of common use the majority of chord structures that flow into popular harmony contain an interval of a Perfect Fifth. In many cases the root of the P5 is the chord root.

I know that science sometimes relies on verification by breaking things into units and trying to establish mechanisms. Sometimes a fundamental simplicity exists that gets missed. Look at how simple the formula E=MC squared is. But pitch perception (meaning notes as heard in chords) is a bit of a common ground. Music theory students do a lot of chord recognition practice - from what I gather there is a pretty broad ability of students to learn to be able to recognize the qualities of chords - and a pretty good consensus now that we hear "notes" not because of the fundamental necessarily being the loudest partial, but because we inherently sense the order of the harmonic series and hear it as a "note", even when the fundamental is missing. Although most pitched non-percussive musical instruments have a louder fundamental it is not the mechanism our hearing process uses to establish note, for when we sing the vowel "Aaaaaaaa" the "note" we hear is not the loudest of the partials, but it is the mathematically correct fundamental of its overtone series.

Our hearing process, from which we are able to extrapolate "notes" and learn to understand their musical function, apparently relates to other perceptic processes. My method involves a correlation of this process with color, mainly for lighting technology. (It apparently may have other applications, including make up and clothing selection apps, and more educational uses.







When we look expecting to find something, we often don't find what's there.
 
Last edited:
Here's a clip I recorded when I first tried MIDI Guitar 2 on my IPad. I used a Line 6 Sonic Port VX. The patch was a builtin Rhodes piano sent for no pitch bend and poly tracking.

 
I used to do a fair bit of research in grad school in things like localization abilities, minimum gap detection etc of normal and impaired listeners. Also used to try to get volunteers from the conservatory of music to try to determine if there was any strong associations between auditory skills and musical ability; and more so, could those skills be learned. Didn’t really have time and funding for large scale work, but certainly saw far better frequency resolution etc in the musicians. Question remained though as to if they refined those skills through music, OR, if they have had a natural ability, which therefore allowed them to flourish as a musician, while other folks without such ability simply can’t seem to make headway learning an instrument and gave up?
 
Here's a clip I recorded when I first tried MIDI Guitar 2 on my IPad. I used a Line 6 Sonic Port VX. The patch was a builtin Rhodes piano sent for no pitch bend and poly tracking.


Here's a clip I recorded when I first tried MIDI Guitar 2 on my IPad. I used a Line 6 Sonic Port VX. The patch was a builtin Rhodes piano sent for no pitch bend and poly tracking.



Thanks very much! Its useful output data from my perspective, but needs manual massaging. The lines and runs that are monophonic are handled quite well - but not perfectly, but its a more difficult problem to do this accurately from a guitar than from a file I'm sure. Also I need to keep in mind that a guitar strum is semi-monophonic - we would need to simultaneously jazz pluck the strings - which maybe you do here and there in the example but I'm not sure. I imagine that some of the polyphonic material is strummed it makes it so much easier to detect because each harmonic set is at a distinct moment and the algorithm can line them up.

Thanks again!
 
Last edited:
If you "literally have no comprehension of the processes or math" you're destined for failure.

Pitch detection is a very advanced topic. If you use an FFT then you will inherently destroy the "timing" because an FFT has no time information. In signal processing we typically use the STFT (short-time Fourier transform) which operates on a frame of data. Your time "ambiguity" is the length of the frame. So just use a very short frame you say. Well then you increase the frequency ambiguity because the frequency resolution is proportional to the frame length.

It's my belief that the ultimate technique would be to use wavelets with the basis set based on something that resembles a short plucked guitar string. However even the FWT is extremely CPU intensive.

Things like Melodyne create markers that determine the start and end of the sound. Then they use advanced (very advanced) signal processing techniques to try to figure out the frequencies in that window.

I loved the opening line of your response.

Really, really helpful stuff Cliff. Thank you. I'm mainly in need of creating an overall proof of concept. Polyphonic pitch detection results can be edited prior to use, and in the meantime I can use MIDI-sequenced bits and MIDI methods like those used originally in Band in a Box and Jammer 4. I'll forward your answer to my brother who has the background. At some point it would be an honor to get your reaction to the patent material.
 
I'm wondering if this is the sort of application that machine learning could be used for. A quick google showed a recent paper on using ML for pitch detection on human voices.
 
I'm wondering if this is the sort of application that machine learning could be used for. A quick google showed a recent paper on using ML for pitch detection on human voices.

Not sure how ML would be particularly useful for the pitch detection itself. One of the big reasons to do voice pitch detection is to better derive intent/mood and such for speech recognition. That’s where ML and lots of crowd sourced data could help considerably. And in this scenario there’s no need to be precise. Most of the things to consider are pitch deltas more than absolute pitch values.
 
Back
Top Bottom