Latent Semantic Analysis – Part 1

Dear Reader,

All the post in have been moved to is no longer functioning. To get the latest updates and follow up your comments please come to and get subscribed.

Find Latent Semantic Analysis – Part 1  @

Thank you



7 Responses to “Latent Semantic Analysis – Part 1”

  1. ganesh Says:

    awesome dude——-
    ur work is very much helpful for me…

  2. sudheer Says:

    can u plz make clear ur assumption and declaration of row -column and vice versa , are different..

    countMatrix [length of ArrayCollection][ total number of documents]

    Also i have posted a question to ur mail did u check.. reply me..

  3. shakthydoss Says:

    CountMatrix [ No.of.row ] [ No.of.column]

    No.of.row is nothing but no of words in all documents
    No.of.column is nothing but no of documents in corpus

  4. sudheer Says:

    One more question i have to you is:

    Is that Number of words in all documents (in previous post ) include:

    1) terms remained after Stopword removal stemming (index words)?? or full document(s) tokens.

    2)the above index words resulted , are they assumed duplicated in index file u have?? or removed repeated words

    Waiting 4 reply

    • shakthydoss Says:

      It is terms remaining after stop-words removing and stemming process.

      Index will not have duplications.

  5. TV TUAN Says:

    You use tdidf (not tfifd), so it might work with word similarity, since you can somehow define the tdidf of the query Q for each document by couting occurences of all words in Q in that document ? (T,F ?), however, if you want to solve doc-similarity then tdidf will not fix the case. I think we need to nomalize both term frequency, not only doc frequency, how do we define the tfidf for document query Q?

  6. huangzy Says:

    I hava a question:

    In this article, you wrote the words followed:
    For example, the word “market” (whose annotated value is 2 in row) appears 4 times in a particular document (whose annotated value is 5) in column.

    4 should not be in the 2nd row, 5th column in the count matrix (the 4 in blue color is 5th row, 2nd column in your picture)? Or am I wrong?

