Modern computational linguistics can crack the encryption on VOIP calls well enough to reconstruct what is being said. Even though they are encrypted, the frames that make up a Skype call contain clues about what phonemes are being spoken.  
Cracking a code is usually a complex matter of mathematics and nothing much else other than mathematics. However, if the encrypted data contain any statistical relationship to the original data then there can be shortcut ways to decryption that make the whole thing much less secure. This is well known and yet you might be surprised to discover that Skype, and many other forms of VOIP telephone systems, are vulnerable to this sort of attack.  
The reason is that the best form of compression for voice data makes use of the structure of speech - the Linear Predictive Filter. The basic idea is that the data is compressed by using an input code word that represents the sound made in the throat by the vocal chords. Then a set of parameters are set in a filter which represents the shape of the mouth and resonant cavities. The parameters are set so that the output matches the sound as well as it can - this is an example of analysis by synthesis, i.e. you analyze a signal by setting up a system that creates it accurately.