LDA

 

Has been a long time I’ve written something, that’s not code. While I write this out, a white noise playlist from some random YouTube channel plays and I also have been discovering a lot of new music lately, thanks to Spotify.

Its 2nd of June 2024, a Sunday and I am currently sitting in my cubicle on the first floor of Faculty building at IIM Nagpur, and the heat outside is enough to toast bread. It’s been half a month since I came here and the people are absolutely amazing, too sweet and helpful, my professor is even more amazing, and I absolutely love working here.
Realizing a lot of things, about my work, about myself, about a lot of things. So now the song has changed to Glided Lily.

 

Just fifteen days and I’ve learnt more than I could ever possibly in that time frame.
Came across something amazing called as Latent Dirichlet Allocation (LDA), which is a method of topic modelling.

Let me give you a guided tour of what LDA is –

Let's say you have a collection of articles from a magazine. You want to classify these articles into different topics, but you don't know what the topics are yet. You can either decide on a fixed number of topics and group similar articles together, or you can set a similarity threshold to create topics dynamically.

 

To figure out which topic an article belongs to, you look at the words used in the article. Some words are good indicators of specific topics. For example, words like "bat," "glove," "base," "homerun," and "stands" might make you think of baseball. However, some words could belong to multiple topics, like "glove" which could also relate to fashion.

 

Words in an article hint at the hidden topic information, which is called "latent." We assume that an article usually covers one main topic, meaning most words in it are about that topic. This assumption is based on a concept called the Dirichlet distribution.

 

To figure out which words belong to which topics, we start with a guess. We assume words in an article are mainly about one topic and that different words from the same topic appear in the same document. We randomly assign words to topics and then check if our assumptions hold. We look at the distribution of words to see if it matches a Dirichlet distribution. If not, we reassign words to different topics.

 

We keep adjusting the word assignments until there are very few changes left. At this point, we stop and have a set of words grouped into topics. Now, we can look at an article and see which topic's words are present in it. If the words from a topic are in the article, we say the article is about that topic.

How beautiful is this, and the fact that there are so many more beautiful iterations of this in existence and many more different things is just so cool. Makes me realize we live an infinitely interesting world.

Its been line 20 minutes since my code has been running, I didn’t know that LDA/genism are such heavy libraries or is it that since I am running them over 100 documents, its taking so much time?

 

I think this is it for today.

Comments

Popular posts from this blog

sting operation

Where have I been?

this sem is about to end.