Sunday, May 5, 2013

presentation


Presentation demo.
It recognizies the first phoneme of a spoken name.
Demonstrates recognition of names (when in range). Non-recognition of name when it is not a match, or if it is out of range.

Currently working on:
final report (sorry, writing always takes me a while).
finishing up log-sampling. The problem is more complicated than I thought. I realized that I don't actually know how to "graph" in matlab. Writing my own "spectrogram" function and sorting buckets require a different distribution and using controls that I'm slowly learning due to a lack of documentation. Will work on it after the final report.

Tuesday, April 16, 2013

following this example:

http://www.mathworks.com/matlabcentral/fileexchange/1553-spectrogram-short-time-ft-log-magnitude

I've been playing around with different sampling schemes.

Here's a derpy graph XD

Wednesday, April 10, 2013

Monday, April 8, 2013

beta video.

Realized I forgot to blog last week. I'll make up for it once I get some sleep. C:


take 1


take 2

Friday, March 29, 2013

Reading and Researching

Caught up on a lot of the readings.

 I learned what dynamic time warping is.
I also learned contradictory things about sampling for sound.


Started coding the part for logrithmatic sampling, but decided a better spend of time may actually be to work on getting a good demo for next week. (My beta review is April 8th). So I've put aside the sampling code for now, and is working on getting the preceptor to understand their name.

After finishing that, i'll work on intergrating microsoft speech sdk into this whole thing.

If I'll have more time, then I'll go back and work on logrithmatic sampling. 

Thursday, March 21, 2013

HCA Tree accomplished

The HCA tree is a weird form of huffman encoding.
The parent's weight is the average of the children weights.

Will look at how to sample over the weekend.

Before I sleep toda

Before I sleep today I will produce the HCA tree. I don't want to spend any more time being confused about which two nodes should be next to each other.

I'll be coding in Java, the algorithm will be similar to huffman encoding.

sudo code:
given confusion matrix, create nodes that represent each node's similarity. Use greedy approach (approximation) to construct the first pair, then proceed from there.

I don't particularly know if this'll work but I'll start from there.

Wednesday, March 20, 2013

Working on binary tree

I can't figure out how to construct it.
should 'p' and 'b' be closer or 'p' and 's'

The distance between the nodes is the distance between the values it represents. when things are in a binary tree, I don't really know how to keep them the same distance as the one from the paper.

looking at this link: http://ww4.aievolution.com/hbm1201/index.cfm?do=abs.viewAbs&abs=6470
It seems that there is a way of computing something called a "similiarity tree" from the similiarity matrix.

This study uses this paper: http://www.sciencedirect.com/science/article/pii/S0304397511007158
for constructing trees
 

Friday, March 15, 2013

Working in the lab

Set up Unity with the project.

At first it wouldn't compile. It took me until Thursday Morning to make the whole thing compile. I had to edit out all the GPU dependencies from the code.

Now it is friday morning. I just ran some propogations.

This is "t"

[0.01889388]:(onsonant_t@
onsonant_t:494.1513), (17_consonant_j_y.png:798.9461), (38_vowel_e_bird.png:800.9612), (07_consonant_tf_ch.png:918.0318), (08_consonant_h.png:932.5713), (37_dipthon_au_out.png:1202.313), (03_consonant_f.png:1400.457),....
this is the vowel from "bet"

[0.01889388]:(owel_3_bet@owel_
3_bet:139.6159), (26_dipthon_ei_eight.png:181.4589), (06_consonant_f_sh.png:189.3021), (33_vowel_u_foot.png:193.5665), (22_consonant_r.png:225.2484), (20_consonant_ng.png:296.1697), (16_consonant_d3_ju.png:301.0428), (15_consonant_3_jzu.png:310.3925), (03_consonant_f.png:318.9848), (23_consonant_w.png:427.7459),
 
 
Ah ha, I have a lot more to work on it seems. 
 
Oh, I didn't finish editing the paper. I don't think I'm going to until next week. I think it is more productive right now to keep on hacking at this code than writing the paper, especially because I'm extremely slow at papers.

Monday, March 11, 2013

Monday

Read through Pengfei's email Actually and took a look at the Unity code again.
I think this step will be more complicated than I originally thought.

Constructing the Hierarchy according to the paper "An Online Algorithm for Hierarchical Phoneme Classification". Sent Pengfei another question regarding how to connect this data, the matlab output and the Unix code together.

Will ask Mubassir about getting a computer in the lab. When I work on my laptop, the Unity code is way too slow. Unity crashed on me about 5 times today on my laptop.

my tree on paper right now

Sunday, March 10, 2013

Back from spring break

Got off the plane a few hours ago.
This is my plan for this week:

Monday: Pengfei replied to my email about Unity. I will use the reply to take a look at the code base again. And put my data through the unity code.
Tuesday:  finish "Monday", and fix the proposal according to comments. (It takes me a long time to write)
Wednesday: Read some papers/research. From a battle plan for how to extract formants/other data. (matlab portion).
Thursday: Continue Wednesday. Start coding is possible.
Friday: catch up or more coding.

There's a Math midterm next Tuesday (the 19th). From Saturday until the midterm, I will not be very productive with this project.

Thursday, February 28, 2013

Unity + Derp = Derpity

I was on schedule to run the whole pipeline once without any algorithm changes, and then I realized I have no idea how to use unity (haha). I finished generating the data file from the SPR portion of the pipeline when I was sleeping last night, and this morning I woke up ready to stuff it into unity.

Behold, I open up the project file, and I have no idea what any of the panels do. So I watched a bunch of tutorials. (like this one: http://www.youtube.com/watch?feature=player_embedded&v=hbjB80-Mc7E ).

Very helpful. Now I am reading the code, and trying to figure out how to specify which sound to propagate. Maybe pengfei will be in the lab tomorrow (thursday), he was busy today >x< ...

The code currently is weird in "initmap" of soundpropogator. I have a vague idea that somewhere i should be replacing the dataset, but I haven't yet found where that lives yet.

I think the code is fine in the init steps however. The reciever is outputting a signal. (I can't tell which signal but there is definatly a signal).

I'll take another look tomorrow.

Tuesday, February 26, 2013

Error using ==> spectrogram


I was getting this error for the longest time... It turns out its not the matlab code's fault. It's my test file's fault. The sound files need to be in mono instead of stereo. haha

here are some results. They look better than the last set of data i was using C:
now i need to do the hand-editing part where i stick all these files through photoshop. 



A comparison. on old data and new data


 above: new 'b'
below: old 'b'




====================================================================





 above: 'a' sound. a vowel
below: 'p' sound. a consonant




 above: 's' sound. A consonant, fricative, voiceless.
below: "ou" sound as in out. A dipthong

Monday, February 25, 2013

Met with pengfei today

Pengfei showed me the whole pipeline today. And was very helpful with the few questions I have.
I realized I actually understood the pipeline wrong.

TODO for alpha: run through the whole pipeline once with my data. I hope I can hit this pipeline.

1) generate images for sound extraction (done)
2) hand edit images (current)
3) run images through comparison code
4) propogate
5) compare

Sound sources using now:
 http://www.teachingenglish.org.uk/activities/phonemic-chart
http://manual.audacityteam.org/man/Tutorial_-_Recording_audio_playing_on_the_computer

I really liked the sound sources from the first link. So using the tutorial on recording within the computer, I am creating a new library of play-able sound.

Friday, February 22, 2013

sound files + bleh consonants

Made a matrix, downloaded some sound files (sources below). Pengfei's matlab code now saves graphs automatically because I got too lazy to save them by hand C: (tehe I love matlab)

http://beta.freesound.org/people/janmario/downloaded_packs/
http://www.phonetics.ucla.edu/course/chapter1/chapter1.html

I think I need to record my own sound files. These sound files I downloaded have too much vowels in them. The image below is the 'p' sound (as in "lip"). The sound file I downloaded from online is really "pa". This makes sense because a consonant is a-periodic, and thus you can't really pronounce the consonant without the vowel. What this means in terms of the graph below, is that the "p" sound is really the vertical blue lines at the beginning and the "a" is the red part. I don't think pengfei's code (as of right now) can deal with the level of detail consonants need to be evaluated at.




At some point, (after figuring out the consonants), we need to catagorize different consonants so we can create a HCA tree.


And because these images are so pretty C:

Above: the vowel in "hot"
Below: the 'm' in "am"


And again, the "hot" sound is fine because it is a vowel. the "m" sound is really "ma" and you can see what we need to take out is just the blue stripe in the beginning. The rest of the data is really an unnecessary "a" sound.


Actually looking at all these graphs. It reminds me of this art project one of my teachers showed me. Spoken word is actually really werid and chaotic. Breaking speech down into phoneme works to a certain extent, but it is actually a pretty bad way of reproducing speech. The artist in the video recorded himself speaking all the different phonemes, and tried to speak with by pasting the phonemes together. (You can clearely see that it doesn't work well).



Sunday, February 17, 2013

Confusion matrix

http://pubman.mpdl.mpg.de/pubman/item/escidoc:67125:6/component/escidoc:67126/Consonant+And+Vowel+Confusion+Patterns+By.pdf

http://pubman.mpdl.mpg.de/pubman/item/escidoc:60592:2/component/escidoc:60593/Cutler_2004_patterns.pdf

http://people.cs.uchicago.edu/~dinoj/research/confmat.html

The one problem that I have with consonants isthat any place I download them, they are flanked by a vowel of some sort (because consonants are more easily recognized when they have vowels attached). But I'm pretty sure for this program, we want the whole thing in terms of just by its self. Maybe I should just record them... ah ha

Friday, February 15, 2013

Meeting notes 2/15/2013

keep on revising the document through the project's duration.

I will get matlab code this weekend, and use the matlab code to generate packets. input the sound, output is flat file and read it into unity.

0th step, use penfei's code as is to propagate
use unity code to get the distorted packe.
see if the simple matlab code is enough. 

mat lab code needs confusion matrix. Idea--the decretization in our code should be same as the one in real life.

Project Proposal C:

Just sent out the revised proposal. Took me a while (somehow writing isn't the easiest thing to me). 
Reference material section hasn't really been revised, but I do have a list of the sources I am using. 

From my todo list I didn't play with matlab. I did play with pengfei's code for about 30 minutes. I don't think I was productive for the 30 minutes. I will try again in the comming week when I have more time.

Oh, I'm not sure what the SVN access is for... I should ask about that.

Obligatory post of the week?

Rewriting my project proposal as I am typing this. Will post it when I'm done. It seems that SPREAD works very very very very differently than what I originally thought when I first read the paper. I'm actually a bit worried that my idea to distinguish between sounds is impossible. In fact, I'm almost sure that doing this will be impossible (given the current SPREAD's data structure). At the same time, this makes me wonder what I can add into SPREAD so that the propagation of speech sounds becomes do-able.

I just sent an email to someone in the linguistics department asking to meet. I hope that I can spend maybe 1 hour with them, and they will tell me what is the minimums set of data needed for a person to distinguish phonemes from each other.

======================================

I didn't do much this week because I was cramming for my STAT exam since Monday, and the weekend I went home for Chinese new year. I'll make up the time next week.

I made a repository for Pengfei's code on my private github account. I've vaguely looked at it. The fact that it is C# is a bit scary. But I'll trust Pengfei when he says it is easy to use.

I did some more reading between doing STAT. And I have a vague plan for the pre-processing part. Currently:

given a sound file, the mat lab code breaks down the sound signals into tiny sound chunks and does Fourier analysis on it. For each time chunk, the code will basically take the Fourier transform with the largest coefficient, and propagate that (which is where the term "SPREAD" comes from. the one frequency will become a spectrum of frequencies as it travels through space).

I know that to distinguish between vowels, at least the information of 3 formats are needed. So 3 packets (minimum) is needed for each chunk of time. I am planning (once I get the matlab code), to see how I could add to it so I can gather the data of 3 formants.

===========================================

My alpha is on the 28th at 3:15. I am aiming for some matlab code, and initial tests.


Tuesday, February 12, 2013

TODO this week

I have a STAT midterm this week. But planning the following by friday:

- rewrite of proposal (I can't really 100% focus on this until thursday night after my exame)
- do a vague "plan of attack" (aka: brain storms ways to approach the problem)
- play with pengfei's code a bit
- play with matlab (a bit of a stretch)

Friday, February 8, 2013

2/8/13 Meeting notes



phonemos --> spr(sound packet representation) --> propagation step (as is) --> {p}' (distorted sound packets) --> perception

phoneme --> spr
decretization step --> wanto minimize the
choose particular sound packet represnetaiton such that the representation of that packet is most similar to the percieved confusion matrix.

perception :
 a) filtering ;;
 b) DTW (dynamic time warping) *simliarity measure --> HMM

(need some kind of confidence)

matlab, optmization, min || C_sir - C_perception ||

task1: not do any work at all, stuff it through and see how it goes (which packets survive at listener)
task2:  apply filter at the spread


 let's assume HMM is solved.

play with sampling and etcetcetc...

Thursday, February 7, 2013

Slides

I keep on forgetting to update this blog.

https://docs.google.com/presentation/d/1eFjVly88QdqGwIjtdX-USomIEKv5i2eYybdmKPDZl14/edit?usp=sharing
I have slides for the presentation tomorrow. I really don't think I'm qualified to give this presentation. I debated how much technical information to include, and I don't think there's that much. All the technical information I vaguely learned this week, I feel like will be uninteresting to the audience. Plus, most of the things I read, I feel that I don't actually understand (most of it I'm reading it to know that it exists somewhere). At some point I lost track of the sources I've been reading. Everything I read, I have to look up half the material (and in that looked-up material, I'm looking up more material...)

Concern for the project however. It seems that my project is really about how to create a HCA from a confusion matrix. However this is after propagation (which also has problems on it's own). 
1 - I don't even know how propagation will turn out (from my readings, it seems that Fourier analysis may not be the best coefficients to use for speech recognition)
2- There's nothing that says a HCA will work on speech sounds.After propogation, if the confusion matrix is very far from the identity matrix (aka: close to a matrix of random numbers), then I don't think I can create an HCA tree for the data.



Monday, February 4, 2013

Friday Meeting Notes

Time-Domain propagation. (with frequency)
1k signal. --> scale any frequency into 1- 1k hz.

if rep 10k hertz; divide all freq by 10.

pretend all frequecy divided by like 5. --> stretch out the sound waves.

SPREAD 50hz (64 hz) then do powers of 2). 5 bands.... etcetc lala

telephone 300 --> 3k hz


so main idea is just stretch the signal, propagate, shrink it to scale.
--> simple experiments

ambient sound.

--> iset --> input speech signal and output phonemes



given phonemes
how does phonemes degrade. --> natural degradation.
-- > phonemes ordered on  user evaluation of any pair simulation
--> clustered together

phonemes confusion matrix <-- HCA tree @U@

tehe

implementation --> HCA tree to phonemes
--> confusion tree


problem statement: propagate speech for
proposed solution:
1- extend sound packet rep
2- propogate that
3- preception cite : construte HAC

experimental target:
1- Assume agent will respond to name
2- will agen respond to name after degredation
3- let one agent call another by name

norm wants cocktail party effect. C:

input: look into existing methods of detecting phonemes in speech
middle: modify spread packet: hca tree: confusion matrix
then: give each agent a name, and a matching algorithm
(lipsinc software)
end:

- need a database of 40 phonemes

phonemes paper--> speech signal degredation/voice quality degredation/ computational representation of speech

alpha --> pipeline

Thursday, January 31, 2013

Re-reading the original paper.

 Currently still reading Computtaional Methods in acoustics. Printed the second half onto paper yesterday. I think this is something that I need to draw on while I read... It's like a math text book and I feel that I probably only need to read it to know a big picture, and where to refer back if I ever need it later.

Re-reading the SPREAD paper after skimming some of the other readings made the paper more understandable. This time, I actually have a vague idea of what HAC is, and what other keywords in the paper are. However, understanding the paper a little more also makes my project proposal a lot more intimidating. (ah ha)

For example: human speech and recognition. From what I understand, SPREAD's recognition happens through HAC of about 100 different enviroment sounds. Would I be running HAC through the 42 different phenemomons of english? Also from what I know of consonants and vowels... you can't really propogate consonants because many times they are just inferred during perception...

TODO: what is TLM... 

Monday, January 28, 2013

Todo list this week

Read a lot of things, and produce a presentation on friday. 5 - 10 minutes oral presentation
and a 20 minute presentation on either item 4 or 5
 
1- pengfei's sound paper (done)
2-   music and computers web text book (done)
3- Computtaional Methods in acoustics (next)
4- Interactive Physically-based Sound Simulation (seend before, but not really read)
5- Zheng's Thesis 

I've seen the Interactive Physically-based Sound Simulation for a project I did in 563 for sound generation, but I've never really read the thesis. I actually don't really know how to read thesis papers. They are 100+ pages long... I've just ctrl+f the key word I may be looking for and read that particlar section, but I'm not sure if this is the correct way to read thesis papers.

Thursday, January 24, 2013

Blah Just realized the full proposal is due tomorrow

Meetings Friday 2 - 3 from now on. Do a presentation by next friday for 15 minutes... Just transferred all the content from the tumblr blog to this one I need to pump out a full project proposal for tomorrow. (Ah ha, I have no idea what I'm writing about).

Example of speech represented via sin waves

Example of speech in sin waves Around half way down the chapter 4.2 page for "Music and Computers". This is an example of speech done with sin waves. It is a bit concerning for me because I’m guessing the system I’ll be using will be using sin waves, and speech sounds may sound like this.

Music and Computers

Music and Computers One of the best resources on how music and computers and sampling rates (not enough math for my taste but it is very understandable). I’m on chapter 4. C: I’ve been doing this instead of my 3D-scanner work like I should. I think I need some time getting used to the idea of bandwidth and sampling rate and such. (I don’t know how to describe it, there’s this extra dimension that I’m not used to when dealing with signals… A little bit like nested for-loops in the beginning of programming). You have sampling rate (which gets the amplitude at each time step); you have bits per sample. I’m missing the idea about how the two are related. I know Nyquist Shannon says you need to sample at a rate 2 times the highest frequency to not have aliasing. So maybe sampling rate and how many buckets (which is related to bits per sample) are related. Apart from that, I’ll be doing 3D-scanning stuff and maybe reading the rest of this link this weekend at pennapps.

Project Proposal

Last semester I was planning to do Senior Design on a 3D-scanning project. This semester, I decided that since I really like sound, I should do what I like. I met with Robo cup and Professor Badler (+Pengfei). I think I am going with the graphics project over the robotics project because I don’t know I have enough background in hardware to deal with the numerous hardware failures in robotics. (pasting my 1 page project brief) Jiali Sheng Project Brief: I’ve always been interested in sound, and I decided to change my project to something relating to sound this semester. I’ve talked with professor Badler and Pengfei, and became interested in “SPREAD”, and would like to work on how sound distorts over distance in the system. Most specifically, how speech (English) distorts over distance. I would first test the system with the 40 phonemes in English and evaluate the cohesiveness of these phonemes as they distort in the simulation over space. The prediction is that because different frequencies of sound will distort differently in SPREAD, the level of distortion will be different with different phonemes. As a result, speech will sound strange (at best), or incoherent through the system. My project will be focused on making English speech more coherent in the system. Without really knowing how SPREAD works, I am unable to give a concrete plan for the actual algorithm. However I do know that similar problems have been looked at by phone companies because higher frequencies distort more than lower frequencies. To solve this, they sample voice at at least 8 kHz. Voice transcription services also have dealt with similar problems, and they employ a guessing system where if one syllable isn’t clear, they will calculate a list of possible syllables that is likely to appear there and take a guess. To me, I think this problem is a lot more interesting than the problem of scanning better frescoes via a better scanner. And I hope that I will have approval to do this project as my Senior Design. My project blog will likely be located here instead: http://soundgen.tumblr.com/ . This is the blog I used for my physics based animation final, but because the topic is more similar, I think it’ll be a better place.