Over the years I’ve participated in a handful of projects where lexical analysis was an absolute necessity. One of the interesting ideas I’ve been playing around with in my spare time has to do with what I call polarity or sentiment analysis. There are companies out there as well as academic groups who have all taken their stab at the idea to varying degrees of success. Before diving into things, let me give an example.
I am a book publisher and I am particularly interested in finding out what the general sentiment of the written reviews are for a given book. I’d like to have some rough idea of whether the public opinion is negative or positive. Sure you could simply look at the number of stars for each book if the site offered ratings like this (such as Amazon), but let’s say you wanted to dig deeper, or perhaps the site offered no stars system.
Wouldn’t it be cool if you could automate that process and apply some sort of AI layer over the top to crunch the human sentences and give you a fairly accurate summary? I sure think so, in fact, I’m willing to bet that this will become more and more in-demand as technology progresses. The question is, will Skynet happen first? (obligatory Terminator 2 reference)
Keep in mind, the book rating example is just one application.
Imagine applying the same concept to trading, whether it be stocks, options, commodities, or perhaps forex. Rather than relying on any given individual’s opinion, you could perhaps do a restrictive Google search for all new posts over the last 24 hours that mention a given stock symbol or forex currency pair. Once performed, you would then programmatically walk through each search result and apply very similar ideas to indicate whether the sentiment was bullish or bearish, helping you decide whether to go long or short.
Now that you have some feel for the applications, on with some basic but helpful code for such an endeavor. One of the first things that would be pretty useful would be to be able to quickly determine word frequency for a given body of text. Imagine having a database of negative and positive words just as a starting place. Sure it wouldn’t be perfect, as you’re looking at individual words instead of groups of words and you’re also completely ignoring linguistic sarcasm, but let’s ignore all that for the sake of simplicity.
public static class CustomExtensions
{
/// <summary>
/// Used to analyze word frequency for a given string.
/// </summary>
public static Dictionary<string, int> GetWordPopularity(this string input)
{
//Taking the input string of sentences and splitting them by space, trashing whitespace words
//and making sure each remaining word has at least one valid alphanumeric to avoid purely
//grammatical words such as "!!" from being passed down the chain. The last portion of the
//chain throws away any non-alphanumeric chars at the end of the word and casts it to lowercase
var words = input
.Split(new char[] { ' ' })
.Where(i => i.Trim() != String.Empty && Regex.IsMatch(i, @"\w"))
.Select(i => Regex.Replace(i, @"[^A-Za-z0-9]+$", "").ToLower());
//Declaration for storing results of our method
Dictionary<string, int> results = new Dictionary<string, int>();
//For each distinct word, we're adding it to our resulting dictionary with its associated count
foreach (string d in words.Distinct())
results.Add(d, words.Where(w => w == d).Count());
//Returning our results, sorted by descending count
return results.OrderByDescending(kvp => kvp.Value).ToDictionary(kvp => kvp.Key, kvp => kvp.Value);
}
}
I made sure I commented the above extension method quite clearly so there shouldn’t be much left to explain. Example to follow:
class Program
{
static void Main(string[] args)
{
string sampleSentences = @"Using LINQ with the .NET 4 framework can be a blast... The amount of flexibility it offers makes the life of a programmer 10 times easier than it use to be, not to mention the shorter and more concise code with lambda expressions!";
foreach (var wordKVP in sampleSentences.GetWordPopularity())
Console.WriteLine("{0,20}{1,5}", wordKVP.Key, wordKVP.Value);
}
}
Extension methods are pretty fun to write and the above call to GetWordPopularity which implicitly passes the value of sampleSentences is a great example of their usage. The output is as follows:
the 4
with 2
be 2
a 2
of 2
it 2
to 2
using 1
linq 1
.net 1
4 1
framework 1
can 1
blast 1
amount 1
flexibility 1
offers 1
makes 1
life 1
programmer 1
10 1
times 1
easier 1
than 1
use 1
not 1
mention 1
shorter 1
and 1
more 1
concise 1
code 1
lambda 1
expressions 1
Not surprisingly, the most common words at the very top are fairly sentiment-agnostic, however if you applied this most basic analysis to a larger body of text, you’d start to see clusters of frequencies. There’s infinite possibilities on how to approach such a fuzzy problem, and the above is just a grain of sand on a very large beach, but hopefully it’s managed to spark some interest in the concepts. Cheers!

