Rss Feed Tweeter button Facebook button Technorati button Reddit button Linkedin button Webonews button Delicious button Digg button Flickr button Stumbleupon button Newsvine button

A Waage Blog

Ruby, Rails, Life

Archive for the ‘Stats’ Category

R Dummy Coding for Categorical (Nominal) Data

without comments

When I’m pre-processing data as input for some classification / clustering algorithm, one of the most common things I need to do each time is convert a categorical attribute into a long, sparse binary vector. For example, if a variable is named “Color”, and the different values present in the data are “red”, “blue” and “green”, here is an easy way to create the dummy vector of attributes. It also handles creating nice column names for the new attributes, so you get 3 binary columns with nice column names like “Color_red”, “Color_blue”, and “Color_green”.

# Include these two functions in your R script or helpers file, and call it like this:
mydataframe <- replace_col_with_dummy(mydataframe, 'Color')
# create dummy coding for category data
dummy_cat<-function(column_name, column){
  idx <- sort(unique(column))
  dummy = mat.or.vec(length(column),length(idx))
  for (j in 1:length(idx)) {
    dummy[,j] <- as.integer(column == idx[j])
  }
  colnames(dummy) <- gsub("[ ]", "_", paste(column_name, idx, sep="_"))
  return(dummy)
}

replace_col_with_dummy <-function(dataframe, column_name){
  dataframe <- cbind(dummy_cat(column_name, dataframe[, column_name]), dataframe[, !(names(dataframe)  %in% c(column_name))])
  return(dataframe)
}

Written by Andrew Waage

October 25th, 2012 at 10:01 pm