1. A search engine that my friend wrote tries to return *all* the documents in its database no matter what the query. Comment on its
precision and recall (qualitatively--as in "high" and "low")
2. You are trying to find all the movies directed by Woody Allen, and so gave the query "woody allen directed" to your search engine.
Suppose it returns IMDB records (IMDB is the internet movie database). What is more important in this case--precision or recall?
3. I am considering viewing documents as bags and use the bag similarity measure to compute their similarity. I am considering three different bag representations:
a. documents as bags of letters (alphabetic characters)
b. documents as bags of words
c. documents as bags of sentences
Comment on the relative precision/recall values offered by the three approaches.
4. One of the CSE faculty members used to have a bunch of magnetic words stuck to his (then metallic) doors. All the students passing by will try to arrange the magnetic words to make interesting english messages. If we assume that each student was making it a point to use *all* the words in his/her message, then what would the bag-of-words similarity measures say about the pair-wise difference among those messages? Considering that I told you that search engines use these as the default similarity metrics, what does it make you think about the intelligence of search engines?
5. Suppose I have a document D1: "Rao is a happy chap" I make another document D2 by copying all the text from D1 and pasting it 2 times into D2 (so D2 is
"Rao is a happy chap Rao is a happy chap"). What will the bag similarity metric say about the similarity between D1 and D2? What will the cosine-theta (or vector) similarity metric
say about the similarity between D1 and D2? Which do you think is better for this case?