Hao Chen | 陈浩     Posts     About Me

Extremely Fast Row Match between Two Matrixes using data.table

Chen Hao posted on 16 Mar 2016

Even though I have heared about data.table many times, but this is the first time I tasted the sweetness of it. It’s really fast.

My problem is to compare rows between two matrix and get the match ID. More exactly, I want a function to match matrix x to matrix y by rows, then return the match ID of y.

My first idea was to claculate the distance between rows of matrix x and matrix y. To solve the memory limit for big distance matrix, I cut matrix x into chunks. Without data.table, I wrote a funciton like this below:


rowMatch <- function(x,y){
    mid <- NULL
    colm <- match(colnames(x), colnames(y))
    if(any(is.na(colm)) || ncol(x) != ncol(y)){
        stop("colnames doesn't match!")
    }
    y <- y[ ,colm]
    
    splitFactor <- splitFactorGenerator(nrow(x), nrow(y))
    dataFolds <- split.data.frame(as.data.frame(x), splitFactor)
    
    for(k in 1:length(dataFolds)){
        cat("Fold ", k , " in ", length(dataFolds), ", ")
        xk <- as.matrix(dataFolds[[k]])
        cat(nrow(xk), "cells  \n")
        xydist <- as.matrix(pdist::pdist(xk, y))
        cat('Distance done\n')
        for(i in 1:nrow(xk)){
            xim <- which(xydist[i,] == 0)
            if(length(xim) == 0){
                message("Row ", i, " has no match in y!")
            }
            mid <- c(mid, xim)
        }
        cat('Match done\n\n')
    }
    
    mid <- unique(mid)
    if(length(mid) == nrow(x)){
        cat("x completely match in y.\n")
    }else{
        cat("Match percentage: ", round(length(mid)/nrow(x),2) * 100, "%.\n")
    }
    
    return(mid)
} 

## generate split factors to split rowNum into folds, each of size around foldSize
splitFactorGenerator <- function(rowNum, colNum){
    if(missing(colNum)){
        colNum <- rowNum
    }
    foldSize <- round(62772663 / colNum)  ## each chunk with maxi 500Mb
    foldNum <- ceiling(rowNum / foldSize)
    lastfoldSize <- rowNum - (foldNum-1) * foldSize
    if(foldNum > 1){
        splitFactor <- c(rep(1:(foldNum-1), each = foldSize), rep(foldNum, lastfoldSize))
    }else{
        splitFactor <- rep(foldNum, lastfoldSize)
    }
    return(splitFactor)
}

But this is fairly slow, especially when the matrix is big. So I started to search for some fast methods. I found this similar question on stackoverflow which shows the usage of data.table. But that doesn’t return the original ID of y, so I modified a little bit to get my second version of rowMatch as below:

require(data.table)
rowMatch2 <- function(x,y){

  colm <- match(colnames(x), colnames(y))
    if(any(is.na(colm)) || ncol(x) != ncol(y)){
        stop("colnames doesn't match!")
    }
  
  keycols <- colnames(x)[colm]
  x <- cbind(x, id=1:nrow(x))
  y <- cbind(y, id=1:nrow(y))
  m1 = data.table(x)
  setkeyv(m1, keycols)
  m2 = data.table(y)
  setkeyv(m2, keycols)
  m1id <- m1$id
  m2id <- m2$id
  
  m1$id <- NULL
  m2$id <- NULL
  
  m <- na.omit(m2[m1,which=TRUE])
  mo <- m2id[m][order(m1id)]
  
  if(length(mo) == nrow(x)){
    cat("Complete match!\n")
  }else{
    cat("Uncomplete match, match percentage is:", round(length(mo)/nrow(x), 4)*100, "%\n")
  }
  return(as.integer(mo))
}

It’s incrediablely fast, I will figure it out and really need to equip with this hack tool for data manipulation.

comments powered by Disqus