LDA
Has been a
long time I’ve written something, that’s not code. While I write this out, a
white noise playlist from some random YouTube channel plays and I also have been
discovering a lot of new music lately, thanks to Spotify.
Its 2nd of June 2024, a Sunday and I am currently sitting in my
cubicle on the first floor of Faculty building at IIM Nagpur, and the heat
outside is enough to toast bread. It’s been half a month since I came here and
the people are absolutely amazing, too sweet and helpful, my professor is even
more amazing, and I absolutely love working here.
Realizing a lot of things, about my work, about myself, about a lot of things.
So now the song has changed to Glided Lily.
Just fifteen
days and I’ve learnt more than I could ever possibly in that time frame.
Came across something amazing called as Latent Dirichlet Allocation (LDA),
which is a method of topic modelling.
Let me give
you a guided tour of what LDA is –
Let's say
you have a collection of articles from a magazine. You want to classify these
articles into different topics, but you don't know what the topics are yet. You
can either decide on a fixed number of topics and group similar articles
together, or you can set a similarity threshold to create topics dynamically.
To figure
out which topic an article belongs to, you look at the words used in the
article. Some words are good indicators of specific topics. For example, words
like "bat," "glove," "base," "homerun,"
and "stands" might make you think of baseball. However, some words
could belong to multiple topics, like "glove" which could also relate
to fashion.
Words in an
article hint at the hidden topic information, which is called
"latent." We assume that an article usually covers one main topic,
meaning most words in it are about that topic. This assumption is based on a
concept called the Dirichlet distribution.
To figure
out which words belong to which topics, we start with a guess. We assume words
in an article are mainly about one topic and that different words from the same
topic appear in the same document. We randomly assign words to topics and then
check if our assumptions hold. We look at the distribution of words to see if
it matches a Dirichlet distribution. If not, we reassign words to different
topics.
We keep
adjusting the word assignments until there are very few changes left. At this
point, we stop and have a set of words grouped into topics. Now, we can look at
an article and see which topic's words are present in it. If the words from a
topic are in the article, we say the article is about that topic.
How beautiful is this, and the fact that there are so many more beautiful
iterations of this in existence and many more different things is just so cool.
Makes me realize we live an infinitely interesting world.
Its been line 20 minutes since my code has been running, I didn’t know that LDA/genism
are such heavy libraries or is it that since I am running them over 100 documents,
its taking so much time?
I think this
is it for today.
Comments
Post a Comment