We are currently working on a voice processing application. In this application we want to perform speaker recognition from the audio file user upload to our backend. Here user is allowed to upload voice recording from our iOS / Android mobile app to the backend application. After receiving the audio, the backend application should compare the voice against existing voice samples of the user and identify whether users voice is there in the newly uploaded audio file. Here is the flow:
1. User create profile by entering email / mobile, password and upload his 4 sample voice files. Each sample files could be 10-15 secs length. These samples voice can be used for comparison when user uploads his voice recordings 3. When user finish signup, application take user to home screen 4. User record a new speech from the application and upload it to server 5. Server should receive the file and validate it 6. After validation is done, it should verify whether user's voice is there in this audio file by comparing this against sample audio files that user uploaded at the time of signup 7. If user's voice is identified in the audio file, we should update it in the database that user's voice is found in it. Then upload the audio file in AWS S3 and send response back to mobile app
All the registered users in our application should be able to upload their audio file and our backend should perform speaker recognition as mentioned above. We are expecting at least 80% accuracy while identifying user’s voice. We also tried to use Speaker Recognition API provided by Azure cloud. But the accuracy is really bad. We also tried Bob Bio Spear library. This library works fine with predefined sample audio files, but not with our audio files.
This requirement may look similar to Shazam, but not. In Shazam, the recorded music is compared exactly with the song they have in their database. So the voice, music should be exactly same as the one they have in their storage. But in our case, we want to compare the user’s voice. When a sample voice is recorded, user can speak any sentence / text. But at the time of uploading audio file, he would have recorded any kind of speech, which would be compared against his sample voice. Also there is a possibility of background noise as well. Here we have to compare only user’s voice, not what he speak.
If you guys have any suggestion on this, please reply to this thread. If you are willing to work on this as a freelancer, please drop an email to [hidden email] <mailto:[hidden email]>. Thanks for your time.