Rss Feed Tweeter button Facebook button Technorati button Reddit button Linkedin button Webonews button Delicious button Digg button Flickr button Stumbleupon button Newsvine button

A Waage Blog

Ruby, Rails, Life

How I finally got Vowpal Wabbit 7.0 installed on OSX 10.6 Snow Leopard

with one comment

I’ve read all the other tutorials online, but none of them really worked. It was more trouble than I expected, but here’s what I had to do:

First Step: make sure you have installed Homebrew. This is the package manager I used to install all the other pre-requisites.
1. install boost

$ brew install boost
$ brew ln boost

2. install automake / autoconf

# May prompt you to overwrite
$ brew install automake
$ brew ln automake
$ brew install autoconf
$ brew ln autoconf

3. install glibtool

$ brew install libtool
$ brew ln libtool

4. symlink glibtoolize as libtoolize

# I guess homebrew installs glibtoolize, so I just had to create a symlink
$ cd /usr/local/bin
$ ln -s glibtoolize libtoolize

Now, we can finally successfully run autogen.sh and the rest of them as described in the README

$ git clone git://github.com/JohnLangford/vowpal_wabbit.git
# Checkout whatever branch you want
$ git checkout -b v7.0
$ ./autogen.sh
$ ./configure
$ make
$ make install

Written by Andrew Waage

November 8th, 2012 at 5:03 pm

R Dummy Coding for Categorical (Nominal) Data

without comments

When I’m pre-processing data as input for some classification / clustering algorithm, one of the most common things I need to do each time is convert a categorical attribute into a long, sparse binary vector. For example, if a variable is named “Color”, and the different values present in the data are “red”, “blue” and “green”, here is an easy way to create the dummy vector of attributes. It also handles creating nice column names for the new attributes, so you get 3 binary columns with nice column names like “Color_red”, “Color_blue”, and “Color_green”.

# Include these two functions in your R script or helpers file, and call it like this:
mydataframe <- replace_col_with_dummy(mydataframe, 'Color')
# create dummy coding for category data
dummy_cat<-function(column_name, column){
  idx <- sort(unique(column))
  dummy = mat.or.vec(length(column),length(idx))
  for (j in 1:length(idx)) {
    dummy[,j] <- as.integer(column == idx[j])
  }
  colnames(dummy) <- gsub("[ ]", "_", paste(column_name, idx, sep="_"))
  return(dummy)
}

replace_col_with_dummy <-function(dataframe, column_name){
  dataframe <- cbind(dummy_cat(column_name, dataframe[, column_name]), dataframe[, !(names(dataframe)  %in% c(column_name))])
  return(dataframe)
}

Written by Andrew Waage

October 25th, 2012 at 10:01 pm

Ruby on Rails Action Named “status” is Reserved

without comments

I just spent over an hour debugging a really frustrating problem. Apparently, defining a controller action as “status” is no good!

It will not break explicitly, but will create all kinds of weird chaos to occur. Please be advised!

Do NOT do this in a controller!

class MyController < ApplicationController
  ## DONT DO THIS!!!
  def status
  end
end

Save yourself some headache :)

Written by Andrew Waage

October 3rd, 2012 at 12:01 am

Rails testing with Machinist 2, Rspec, Database Cleaner Gem

without comments

QUICK vent and advice when using Machinst2 and Database Cleaner to test in Rails:

TURN OFF MACHINIST CACHING!

Add this to your environments/test.rb file:

Machinist.configure do |config|
  config.cache_objects = false
end

Machinist tries to do some weird caching to make your tests run faster. But, it doesn’t quite work the way you’d expect. If you are running into strange problems where your objects are persisting through many tests, even though you are using DatabaseCleaner after each test, you might try this. If you run into problems where running one test at a time works, but running “rake spec” results in errors, this is also worth a shot. Don’t let Machinist caching drive you nuts! :)

Sidenote: In my experience, the best way to debug these errors that appear when running the entire test suite, but do not appear when running individual tests is to use rspec to run all but one test. Remove one at a time, and see if removing that single test helps eliminate errors.
Example:

# If this gives errors:
$ bundle exec rspec ./spec/models/user_spec.rb ./spec/models/account_spec.rb ./spec/models/favorite_spec.rb
# Try removing the first
 $ bundle exec rspec ./spec/models/account_spec.rb ./spec/models/favorite_spec.rb
# Try removing the 2nd
 $ bundle exec rspec ./spec/models/user_spec.rb ./spec/models/favorite_spec.rb
# Repeat...

Written by Andrew Waage

April 11th, 2012 at 6:59 pm

Thinking Big (Data)

without comments

Recently I’ve been doing lots of “big data” sorta things. As a result, I’ve been forced to look at solving programming problems from a slightly new perspective.

Doing things like data mining with huge dataset pre-processing and manipulation, matrix computation, and other machine learning tasks can be computationally expensive and take forever! This has forced me to pay close attention to things such as code efficiency (not using Ruby for certain things – gasp!), database tuning, refactoring queries, optimal database schemas (and database choices), caching at all levels (application, db, etc), parallel processing (along with all its problems, concurrency issues, deadlocks, etc), system memory management, map-reduce, and on, and on..

One thing I especially love about working with massive data sets is that it really forces me to think “big”. No longer am I constrained to solving the problem with a single machine! In a couple clicks, I can create additional servers or double or triple the size of my server, quickly do computation or processing, and then bring the server back down to a more affordable size.

In this day and age, with cloud computing services like Amazon EC2 / Rackspace at your fingertips, you only pay for what you use. This is an important concept to consider, and definitely requires a different way of thinking about solving a “big” problem.

Of course the first step should still be to optimize your code, and use the right tools for the right problem. But, these are additional tools in your toolbox, that really pay off when you need them.

Here’s an example I recently encountered.

Prototype:
1. I started with a small test dataset matrix (3 MB, 3000 rows)
2. I wrote a Ruby/mysql script that would prepare this matrix from our DB. It took about 5 minutes, and that was fine. 5 minutes I can spare.
3. Next, I wanted to try this on my 2nd dataset matrix (20MB, 20000 rows). Hmm.. took over half an hour. That’s not THAT bad. But, imagine running this on a 200MB dataset, 20 GB dataset, or 2 TB dataset ?!! Gotta do better.

Step 1: optimize code / queries:
I spent a good amount of time tuning Mysql’s memory/caching/threads settings, making smarter queries, doing joins and de-normalizing tables to reduce queries. Knowing that Ruby is not the fastest language out there :( – I tried to be smarter about not doing extra iterations, using extra memory, etc.
- Result: I was able to cut down the time of the 2nd dataset from 30 minutes to about 7 minutes! (not bad)

Step 2: Parallelize code
I knew that the code was basically iterating over a huge array, so I wanted to try parallizing some of the actions it was performing. This was my first try at using Ruby’s Parallel gem, which allows you to split up any code into parallel threads or processes. Pretty cool stuff. I experimented with a bunch of different #’s of processes / threads to find out what the best settings were. Turns out the optimal setting was to use 4 concurrent processes. For what I was doing, my code was accessing the database (very costly I/O operations). So, the database was actually a bottleneck. Using more than 4 concurrent processes would result in too much DB traffic, and would hurt the performance.
-Result: I was able to cut down the time of the 2nd dataset from 7 minutes to about 1.5 minutes! (Getting there!)

Step 3: Try my sexy parallelized code on a bigger data-set (12GB).
- Of course I hit all kinds of problems – memory limit problems, database collisions, what’s the optimum number of parallel processes? etc, etc. Not so fast!
-Result: I was forced to run the code on a single process, that took something like 5-6 hours (boo :( )

Step 4: Figure out how to run the parallelized code on the large dataset
This required examining the logs and figuring out why I was hitting bottlenecks. It turns out that when I ran many concurrent processes, they were all fighting for DB access. Mysql didn’t like this and in the end, all processes became stalled. I found that by putting an increasing delay in each subsequent child process, I could avoid collisions. Since each process performed a huge DB query up front, and then spent the rest of its life using memory / CPU (but not competing for DB), they would all run together, nicely.
-Result:I was able to cut down the time of the huge dataset from 5-6 hours to about 1.5 hour! (yes!)

Step 5: Not really completely satisfied, I wanted to try enlarging my cloud instance and running the same exact command. I knew that on my 4GB server, the CPU and memory were all maxed out by my processing script. So, I tried increasing it to 16GB. The results were not so stellar.
-Result: I was able to cut down the time of the huge dataset from 1.5 hours on the 4GB machine to about 1 hour on the 16GB machine. This is with no additional db tuning (to make use of the additional system resources), nor adjusting the number of processes (a powerful server could have made use of more concurrent processes).

What I should have done next: Tune the db settings, tune the number of processes, and try again.

What I actually did instead: Called it a day, and wrote this article :)

My personal takeaways:
- Big data is exciting
- Use the right tools for the right job
- Optimize my code
- Tune my DB settings
- Use the cloud (Thanks Rackspace!)
- Think big!

Written by Andrew Waage

March 8th, 2012 at 10:49 pm