[Code] Google Reader Topic Organizer
Jan. 22nd, 2012 08:01 pmCode: http://pastebin.com/s5bkVEuV
For the longest time, I'd been hurting for a good, topic-based organization scheme for all of the feeds I consume in Google Reader. This may be useful to anyone that has this same problem with their own RSS reader, so I figured I'd publish this for the Greater Good™.
My problem is: I consume (not necessarily read) over 500 RSS-syndicated articles a day. Because of this, I need a good, fast indexing scheme that tells me exactly what the most common topics happen to be so I can organize my time effectively. I have less time for obscure things, though if I'm targeting those to give myself a rest from the tedium of daily news, I'd like to be able to see them at a glance, too.
Doing this on a source-by-source basis falls apart when common issues (like SOPA) transcend sources that might otherwise contain very focused content. So, I needed a good organizational scheme that was topic-oriented, source-agnostic, and binned all content by its most common keywords and key phrases. Further, it needed to target what the articles are actually saying, instead of what the articles think they're saying (via, say, tagging the article).
The easiest organizational scheme I could think of was to present feeds similar to new posts in an Internet forum. So, I did. Technical babbling follows on how the algorithm does that.
Because titles usually contain the most relevant content for a particular item, each article's title receives its own topic category automatically, according to very simple rules for word and phrase relevancy. "Relevancy" is determined by how often words and phrases appear in the title, subtracted by how often each component word appears in the message body of all articles. In other words, very simple, quorum-based voting, emphasizing well-organized articles that put their most relevant information in the title and save all of their flavor text for the body.
End technobabble.
What I've found is this simple scheme works surprisingly well. It allows me to take my complete collection of feeds and articles and extract, at a glance, the most relevant ones that I should be reading. And I like that; it gives me good, high level insight that I can use to cut out much of the noise in favor of juicy, juicy signal. And it's so useful that I'm left to wonder why more RSS readers don't do this effectively, in idiomatic, simple, and well-organized ways.
So, as a grassroots effort at improving everyone's online reading experience, I figured I'd just release the code. It's pretty technical, for those that don't like playing with Python code and bending it to their will.
But for everyone who do: would you share this, improve it, and get it submitted to applications that should be using exactly this sort of organizational scheme? I'd greatly appreciate it, if only for the joy of knowing I helped make this little place we call the Internet that much easier to work with.
For the longest time, I'd been hurting for a good, topic-based organization scheme for all of the feeds I consume in Google Reader. This may be useful to anyone that has this same problem with their own RSS reader, so I figured I'd publish this for the Greater Good™.
My problem is: I consume (not necessarily read) over 500 RSS-syndicated articles a day. Because of this, I need a good, fast indexing scheme that tells me exactly what the most common topics happen to be so I can organize my time effectively. I have less time for obscure things, though if I'm targeting those to give myself a rest from the tedium of daily news, I'd like to be able to see them at a glance, too.
Doing this on a source-by-source basis falls apart when common issues (like SOPA) transcend sources that might otherwise contain very focused content. So, I needed a good organizational scheme that was topic-oriented, source-agnostic, and binned all content by its most common keywords and key phrases. Further, it needed to target what the articles are actually saying, instead of what the articles think they're saying (via, say, tagging the article).
The easiest organizational scheme I could think of was to present feeds similar to new posts in an Internet forum. So, I did. Technical babbling follows on how the algorithm does that.
Because titles usually contain the most relevant content for a particular item, each article's title receives its own topic category automatically, according to very simple rules for word and phrase relevancy. "Relevancy" is determined by how often words and phrases appear in the title, subtracted by how often each component word appears in the message body of all articles. In other words, very simple, quorum-based voting, emphasizing well-organized articles that put their most relevant information in the title and save all of their flavor text for the body.
End technobabble.
What I've found is this simple scheme works surprisingly well. It allows me to take my complete collection of feeds and articles and extract, at a glance, the most relevant ones that I should be reading. And I like that; it gives me good, high level insight that I can use to cut out much of the noise in favor of juicy, juicy signal. And it's so useful that I'm left to wonder why more RSS readers don't do this effectively, in idiomatic, simple, and well-organized ways.
So, as a grassroots effort at improving everyone's online reading experience, I figured I'd just release the code. It's pretty technical, for those that don't like playing with Python code and bending it to their will.
But for everyone who do: would you share this, improve it, and get it submitted to applications that should be using exactly this sort of organizational scheme? I'd greatly appreciate it, if only for the joy of knowing I helped make this little place we call the Internet that much easier to work with.