Listener Gets a VAD

So, the beginning of the 4th semester in the midst of losers and overachievers and this sem promises to set my a$$ on fire. As usual, I plan to continue working under Dr. Kihara this sem so that should be interesting. Anyway, I decided to improve upon what listener offered and decided to add a VAD algorithm to it. I initially chose the algorithm by moattar and homayounpur and decided that I ended up with too much to do (it might certainly be a good candidate for later, when I have more time for example). Hence, I decided to snoop around for something simpler and found this paper which seemed small and had a sort of ever changing threshold for successive frames. The paper was authored by S.Milanovic, Z. Lukac and A. Domazetovic. I still don't think I got it exactly right though. The paper mentioned that they used counters to mark frames as silence based on what the previous frame was and I had to come with a counter upper bound for myself and finally chose to go with 10 as a good counter for such stuff. i.e. even if a particular frame doesn't make it beyond its threshold, it still will be marked as active if the previous frame was active. This is done to accommodate situations where we end up reducing our volume at the end of a word / sentence.

Finally to decide whether there was speech on an overall level, I look for at least 3 instances of 18 consecutive frames being marked as active (just random picks, 18 frames allows 8 active frames and 10 additional for the counter we have and 3 looked like a good candidate at the end when I spoke my own name out).

And as a final measure, I also ensure that the overall intensity beats 48 dB so that someone trying to have a conversation with me is only recognized.

Finally, I made the switch from GeekTool to Growl as this thing kept taking a solid amount of real estate and since I have one 23'' monitor and a 15'' monitor, this thing is positioned outside the real estate of my laptop's display. Growl seems like a better candidate overall and since I could get growl bindings to build on my machine finally, I think I should let growl handle this.

So, the only places where my VAD implementation (or my mod of whatever was in that paper) doesn't seem to work is in surroundings with a piano (in our dorm's lobby for example), v inconvenient but whatever, probably some time in the future, I will begin understanding DSP and spectral analysis well enough to come up with a simple VAD algorithm (as opposed to implementing something straight from a paper without any understanding of what is going on). Anyway, here is the updated script, it seems to do well recognizing speech in sort of silent settings:

#!/usr/bin/env python
#Author: Shriphani Palakodety
#Tool to aid those with noise cancellation headphones

import pyaudio
import wave
import sys
import struct
import numpy
import time

Growl_exists = True

try:
	import Growl
except ImportError:
	print "No Growl"
	Growl_exists = False
	pass

skype_on_call = False
notifier = 0
if Growl_exists:
	notifier = Growl.GrowlNotifier('Listener',  ['Attention', 'test'])
	#notifier.applicationName = 'Listener'
    	notifier.register()

def record():
    '''Records Input From Microphone Using PyAudio'''
    duration = 3 #record for 1 second. Pretty long duration don't you think
    outfile = "analysis.wav"
    
    p = pyaudio.PyAudio()
    
    inStream = p.open(format=pyaudio.paInt16, channels=1, rate=44100,input=True, frames_per_buffer=1024)

    out = []
    upper_lim = 44100 / 1024 * duration #upper limit of the range we record to. 44100 / 1024 sized chunk * 5 seconds
    
    for i in xrange(0, upper_lim):
        data = inStream.read(1024)
        out.append(data)
    
    #now the writing section where we write to file
    data = ''.join(out)
    outFile = wave.open(outfile, "wb")
    outFile.setnchannels(1)
    outFile.setsampwidth(p.get_sample_size(pyaudio.paInt16))
    outFile.setframerate(44100)
    outFile.writeframes(data)
    outFile.close()
    analyze()


def analyze():
    if skype_on_call:
    	print "n"
    	print "Skype Call In Progress"
	print "Listener On Hold"
	return
    inFile = wave.open("analysis.wav", "rb") #open a wav file in read mode
    thresh = 1000  #establish a minimum threshold
    max_samp = 0		
    
    decision = [0]

    #for i in xrange(441):

    inactive_counter = 0
	
    vals = inFile.readframes(inFile.getnframes()) #read in 30 samples
    len(vals)
    results = struct.unpack("%dh"%(inFile.getnframes()), vals)  #unpack to get the samples
    results = [abs(x) for x in results]
    
    #now we need to pull 30 samples at a time (30 samples = 1 frame).

    for i in xrange(4404):
	frame = results[30*i: 30*(i+1)]
	print frame
	new_thresh = (thresh * (1 - (2.0 ** -7)))  +  ((2 ** -8) * max_samp)
         
	#check how many samples go above this new threshold
	count = 0

	for j in frame:
	    if j > new_thresh:
		count += 1
	if count / 30.0 >= 0.9 :   #need it to beat 90%
	    #frame is a candidate for speech
	    decision.append(1)
         
	else:
	    #this is where we use a counter based implementation for labelling inactiveness
	    if inactive_counter < 10 and decision[-1] == 1: #we ignore silence for 10 runs
		decision.append(1)
		inactive_counter += 1
	    else:
		inactive_counter = 0
		decision.append(0)
         
	#update the threshold and the max sample values
	thresh = new_thresh
	max_samp = max(frame)

    #final check for characterization as speech, we use another counter
    active_counter = 0 #since the inactive counter will cause silence to be recognized as speech, we only consider speech as 
    print decision
    final_num = 0
    for val in decision:
	if active_counter >= 18:
	    print "Speech!"
	    final_num += 1
	    active_counter = 0
	if val == 1:
	    active_counter += 1
	else:
	    active_counter = 0



    results = [x ** 2 for x in results]
    intensity = 20 * numpy.log10(numpy.sqrt(sum(results)/inFile.getnframes()))
    
    if final_num >= 3 and intensity > 48:
    	if Growl_exists:
	    notifier.notify('Attention','Listener', 'Speech Detected Nearby')
	else:
	    print "Speech Detected Nearby!nSomeone might be calling you"
    inFile.close()

if __name__ == "__main__":
    f = open("skype_Status", "r")
    for new_line in f:
	if new_line == "PROGRESS":
            skype_on_call = True

    if skype_on_call:
	analyze()
    else:
	record()

Anyway, it would be really convenient if I could find something about VAD algorithms and improve listener to work better for my dorm room settings. It is doing a pretty good job already but there is always scope for improvement.

As always, my solutions need to be convoluted and over here, I make use of applescript to check if there's a skype call going on or not, so yeah, you can find all that here.

Screenshots etc available on Listener's new home: http://shriphani.com/blog/listener/.

Leave a Reply