Friday, September 17, 2010

Blog questions on document similarity measurement

1. A search engine that my friend wrote tries to return *all* the documents in its database no matter what the query. Comment on its
precision and recall (qualitatively--as in "high" and "low")

2. You are trying to find all the movies directed by Woody Allen, and so gave the query "woody allen directed" to your search engine. 
Suppose it returns IMDB records (IMDB is the internet movie database). What is more important in this case--precision or recall?

3. I am considering viewing documents as bags and use the bag similarity measure to compute their similarity. I am considering three different bag representations:

     a. documents as bags of letters (alphabetic characters)
     b. documents as bags of words 
     c. documents as bags of sentences

Comment on the relative precision/recall values offered by the three approaches.

4. One of the CSE faculty members used to have a bunch of magnetic words stuck to his (then metallic) doors. All the students passing by will try to arrange the magnetic words to make interesting english messages. If we assume that each student was making it a point to use *all* the words in his/her message, then what would the bag-of-words similarity measures say about the pair-wise difference among those messages? Considering that I told you that search engines use these as the default similarity metrics, what does it make you think about the intelligence of search engines?

5. Suppose I have a document D1: "Rao is a happy chap"   I make another document D2 by copying all the text from D1 and pasting it 2 times into D2 (so D2 is
"Rao is a happy chap Rao is a happy chap"). What will the bag similarity metric say about the similarity between D1 and D2? What will the cosine-theta (or vector) similarity metric
say about the similarity between D1 and D2? Which do you think is better for this case?



  1. 1. This search would have very low (0) precision, and very high (1) recall.

    2. In this case, precision would be more important. If the results have too high of recall, you would end up with results including movies Woody Allen was in, but didn't direct, or movies that were directed, but not by Woody Allen. Either of those two cases are not what the user intended to find.

    a. Letters: This would have much too high recall, and much too little precision. For most searches, this would be useless because it would return way too many results.
    b. Words: This is right in the middle, and is considered a pretty good method.
    c. Sentences: This may be useful in some cases, but would have very low recall, and very high precision. It would require that two whole sentences be exact to give a match.

    4. This would render our search useless at the word or letter level. Searching for any given word would return all the student's combinations. This shows that search engines really aren't that smart.

    5. In this case, a bag result would show 0 difference between the two, while a vector result, would show that D2 has "twice" the similarty to a given search.

    I thought this video was interesting. It's Douglas Merrill discussing how Google does the "Did you mean?" stuff in their search. It's called statistical machine learning. It's 60 minutes long, but if you watch for 2 minutes from where I linked, it's really interesting.

  2. This comment has been removed by the author.

  3. 1. Low precision, high recall.

    2. Precision is more important in this case.

    A. Letters could return a lot of false-positives. It would have low precision and high recall.
    B. A fair balance between recall and precision.
    C. High precision, low recall.

    4. It would return a value of 1 when comparing the different sentences.

    5. Wouldn't it say the similarity is ½?

    Oh, and I found this since it came up in class:

    That guy came off as a complete asshole.

  4. On the Woody Allen movies issue, consider the fact that once the movie record is given back to you, you can probably tell whether the movie is directed by Allen or not (assuming, as we do, that IMDB doesnt make up movie records). So in a way, the bigger problem is not being returned some of the Allen movies.

    Ivan--please moderate your language since it is a class blog and others might have a more sensitive disposition than you..

  5. 1. 0 precision very high recall

    2. precision would be more important than recall but recall would still be important

    3 A. High recall low precision
    B. Balance of recall and precision
    C. High precision low recall

    4. very low precision because all of the sentences would have the same bag of words... this would mean that search engines are really simple

    5. i think the similarity would be 1/2 since the comparison between the bag of words would be 1 word from the phrase to 2 words from the document

  6. 1. Precision is very low here, and recall is very high.

    2. Im not entirely sure, because it gives you a site that directs you to your search, but its not very accurate for exactly what you are searching, so i'd say low precision high recall.

    If documents were bags of letters, the number would be a very low ratio, because documents tend to be much larger than what you are searching, so it would be around 10/8923109843297409238750984 (exaggeration)
    If documents were bags of words, the ratio would go up slightly, but it would still be around 1/92384.
    If documents were bags of sentences on the other hand, the ratio would be very small, the chances of a document containing your exact sentence are quite low. it would be around 1/10 or maybe even lower.
    (made up numbers, not sure if they are even correct assumptions)

    4. Complete honesty here... What? i think i might have to have that explained in person to understand what you are trying to explain.

    5. The ratio would be 1/2, and the cos would produce 45 degrees, because.... idk... that is what i want the answer to be.

  7. 1. low precision and high recall
    2. precision would be more important
    3. letters would have way too much recall and low precision
    words would be somewhere in the middle with decent precision
    sentences would have very low recall but be very precise.
    4. it would have very low precision because it would return all of the messages.
    5. i think that the bag similarity would show 1 while the vector would be 2

  8. 1. Your friend would get low precision and high recall.

    2. In this case, the precision would be more important and would need to be high.

    3a.) documents as bags of letters (alphabetic characters)- would have a low precision and high recall
    3b.) documents as bags of words - the precision and recall would be in the mid-range
    3c.) documents as bags of sentences - would be a high precision but have a low recall

    4. In this case, the precision would be low and the recall would be high. The intelligence of the search engine is not too bright.

    5. I was a little confused on this one and not sure if it is correct. But I did the intersection which is the minimum of words of both bags and then the union which is all the words added togther. So by my calculations:

    cos ((bag1*bag2)/(bagsize1*bagsize2))
    cos ((1*2)/(5*10))
    cos (1/25) = .999 or arounded to 1

  9. 1. Precision would be "omg"ly low and recall would be pretty high.
    2. Wouldn't this depend on what the user is specifically trying to find? I think that, without those given details, they would both be pretty useful.
    3. a) Very low precision
    b) This is a good range, which is why we do it mostly in this form.
    c) a little too precise...without many options to choose from. so precision would be high but low recall.
    4. It would make me think that they're actually not that great. If they randomly put stuff together and give results....then it sounds like it can't be trusted all too much.
    5. I have absolutely no idea what's going on right here...

  10. 1. Low precision and high recall.
    2. I think it will be precision.
    3. a) high recall & low precision
    b) neither high or low recall & precision
    c) high precision & low recall
    4. a) 1
    b) Search engines will not be that smart if they use that same similarity metric because they will not care about the order of the letters giving a lot of different combinations.
    5. a) I think that the similarity metric will be 1 … saying that they are the same.
    b) Cosine (0) =1 ?

  11. 1. low precision high recall.

    2. could be precision or recall, if it's just returning IMDB records only.

    3. letters would have way to high a recall and too low a precision. Depending on your search words and sentences would have the best chance at precision, depending on what you search is, there may not be a good chance of your exact sentence existing, which is high precision and low recall.

  12. 1. low precision, practically no precision.
    high recall
    2. in this case, you'd rather be precise then have more results.
    3.a. high recall, low precision
    b. average recall and precision
    c. low recall, high precision
    4. search engines are kind of dumb in this regard. they would have low precision as it would return similar results in all attempts. I think.
    5. I'm not really sure what to do here

  13. 1. low Precision, high recall
    2. Precision, because then it would only return with the movies woody allen directed. instead of moives that were directed or movies that had him in them.
    3. High recall and low precision
    High Precision and low recall
    4. It makes me think that all they do is compare words, thus are not that complex of programs
    5. no idea

  14. 1. low precision, high recall
    2. Precision is more necessary
    3. high recall, low precision
    -average recall and precision
    low recall, high precision
    4. Search engines wouldn't be as great as they are now because they dont care about order and just the presence of those certain letters
    5. I'm not sure how cosine-theta applies here


Note: Only a member of this blog may post a comment.