*************************************************************************
Dear Reader,
All the post in shakthydoss.wordpress.com have been moved to shakthydoss.com
shakthydoss.wordpress.com is no longer functioning. To get the latest updates and follow up your comments please come to shakthydoss.com and get subscribed.
Find Latent Semantic Analysis – Part 1 @ http://shakthydoss.com/latent-semantic-analysis-part-1/
Thank you
shakthydoss
**************************************************************************
January 31, 2011 at 5:50 pm
awesome dude——-
ur work is very much helpful for me…
March 5, 2011 at 4:45 pm
hi,
can u plz make clear ur assumption and declaration of row -column and vice versa , are different..
countMatrix [length of ArrayCollection][ total number of documents]
Also i have posted a question to ur mail did u check.. reply me..
March 6, 2011 at 11:55 am
CountMatrix [ No.of.row ] [ No.of.column]
No.of.row is nothing but no of words in all documents
No.of.column is nothing but no of documents in corpus
March 15, 2011 at 9:12 am
One more question i have to you is:
Is that Number of words in all documents (in previous post ) include:
1) terms remained after Stopword removal stemming (index words)?? or full document(s) tokens.
2)the above index words resulted , are they assumed duplicated in index file u have?? or removed repeated words
Waiting 4 reply
Regards
March 15, 2011 at 10:47 am
It is terms remaining after stop-words removing and stemming process.
Index will not have duplications.
September 21, 2011 at 2:23 pm
You use tdidf (not tfifd), so it might work with word similarity, since you can somehow define the tdidf of the query Q for each document by couting occurences of all words in Q in that document ? (T,F ?), however, if you want to solve doc-similarity then tdidf will not fix the case. I think we need to nomalize both term frequency, not only doc frequency, how do we define the tfidf for document query Q?
February 22, 2012 at 5:32 pm
I hava a question:
In this article, you wrote the words followed:
For example, the word “market” (whose annotated value is 2 in row) appears 4 times in a particular document (whose annotated value is 5) in column.
4 should not be in the 2nd row, 5th column in the count matrix (the 4 in blue color is 5th row, 2nd column in your picture)? Or am I wrong?