Chandler Zuo

How to Translate SQL Queries to Simple R Codes

08 Oct 2014 | Chandler

It is sometimes desirable to process queries for table manipulation in R. Currently, there are a couple of packages that give database interfaces that allow you to directly send SQL queries to external databases. However, for simple tasks we may not want such a complicated infrastructure. So, are there some simple alternatives?

The answer is yes. Although R can not understand SQL queries directly, we can often translate the query into several table transformation steps. And believe it or not, with some little tricks, R codes can actually perform very complicated tasks!

There are two tricks that I find especially useful in translating SQL queries:

HashKey trick: name your rows by some distinct character strings. Notice that if you name the rows in an R table, then each row can be directly accessed by its name. This is especially useful when you wish to generate a composite key using several columns of the table. I name this trick the “HashKey trick” because these strings essentailly act as hash keys.
data.table: use the {data.table} package. This is a powerful package that can perform table joins and aggregations in pretty neat codes.

To illustrate their usage, I will use three examples modified from Ramakrishnan & Gehrke’s DBMS textbook.For each example, I will write the SQL query and the R codes using the two tricks that I mentioned above. The examples use the following schema:

Sailors(tid: integer, sid: integer, sname: character, rating: integer, age: numeric)
Boats(bid: integer, bname: character, color: string)
Reserves(tid: integer, sid: integer, bid: integer, day: integer)

By such notations, I mean that table “Sailors” is a data.frame containing three columns, “sid” is an integer column, “sname” is a character column, and “age” is a numeric column, and etc. Now I will present you some examples.

The “Sailors” table has distinct combinations of “tid” and “sid” columns. The “Boats” table has distinct values in the “bid” column.

PREPARATION

You can generate the synthetic data set in R:

sailors <- data.frame(tid = rep(1:2, each = 5), 
		sid = rep(seq(5), 2),
		sname = c("a", "b", "c", "d", "e", "f", "g", "a", "b", "c"), 
		age = sample(20:25, 10, replace = TRUE))
boats <- data.frame(bid = 100 + 1:5, 
		bname = c("x", "x", "y", "z", "w"), 
		color = rep(c("red", "blue"), c(2, 3)))
reserves <- data.frame(tid = sample(1:2, 10, replace = TRUE), 
		sid = sample(1:5, 10, replace = TRUE), 
		bid = sample(100 + 1:5, 10, replace = TRUE), 
		day = sample(1:365, 10, replace = TRUE))
sailors$sname <- as.character(sailors$sname)
boats$sname <- as.character(boats$bname)
boats$color <- as.character(boats$color)

We need to convert the tables into the data.table class:

library(data.table)
Sailors <- data.table(sailors)
Reserves <- data.table(reserves)
Boats <- data.table(boats)

(Q1) Find the names of sailors who have reserved Boat 103.##

SQL query:

select Sailors.sname
from Sailors, Reserves
where Sailors.sid = Reserves.sid and 
	Sailors.tid = Reserves.tid and
	Reserves.bid = 103;

Hashkey trick:

We use the combination of “tid” and “sid” columns as the hash key for “Sailors” table. The R code is the following:

rownames(sailors) <- paste(sailors$tid, sailors$sid, sep = "_")
reserves <- within(reserves, key <- paste(tid, sid, sep = "_"))
key_target <- reserves$key[reserves$bid == 103]
sailors[key_target, ]$sname

The trick here is that, once you name the rows of a table, you can directly access that row using its row name. This is what happened in the fourth line.

data.table trick:

Then, we need to set up keys for table join. The keys are the columns that are used to match the two tables:

setkey(Sailors, tid, sid)
setkey(Reserves, tid, sid)
Sailors[Reserves][bid == 103, sname]

Notice the third line: Sailors[Reserves] joins the tables by the sid column, and [bid == 103, sname] performs the query and output the “sname” column. This is very neat!

(Q2) Find the colors of boats reserved by “a”.##

SQL query:

select Boats.color
from Sailors, Reserves, Boats
where Sailors.sname = "a" and Sailors.sid = Reserves.sid and 
Sailors.tid = Reserves.tid and Boats.bid = Reserves.sid;

Hashkey trick:

key_target <-rownames(sailors[sailors$sname == "a", ])
reserves_tmp <- reserves[reserves$key %in% key_target, ]
rownames(boats) <- boats$bid

boats[as.character(reserves_tmp$bid), ]$color

data.table trick:

setkey(Boats, bid)
Sailors_Reserves <- Sailors[Reserves][sname == "a", ]
setkey(Sailors_Reserves, bid)
Boats[Sailors_Reserves, color]

##(Q3) Find the bids of the boats for each sailor that he/she has not reserved.##

SQL query:

select bid, sname
from Boats, Sailors
where not exists
    (select *
    from Reserves
    where Reserves.sid = Sailors.sid and Reserves.tid = Sailors.tid and Reserves.bid = Boats.bid);

Hashkey trick:

Here is a little brute force: we calculate the cartesian product between Sailors and Boats, and see which combination is not contained in Reserves. Notice that we do not need to actually generate the cartesian table. We only need to compute the “hash keys” of this conceptual table. However, a little string manipulation is needed to get the hash keys for our results (“key_boats” and “key_sailors” in the following).

key_cartesian <- paste(rep(rownames(sailors), each = nrow(boats)), 
	rep(boats$bid, nrow(sailors)), sep = "_")
key_reserves <- paste(reserves$tid, reserves$sid, reserves$bid, sep = "_")
key_target <- key_cartesian[! key_cartesian %in% key_reserves]
key_split <- strsplit(key_target, "_")
key_boats <- sapply(key_split, function(x)x[3])
key_sailors <- sapply(key_split, function(x) paste(x[1], x[2], sep = "_"))
cbind(sailors[key_sailors, ]$sname, key_boats)

data.table:

The idea is the same as using the hashkey trick. We need to compute the cartesian product using a dummy variable:

Sailors[, k:=1]
Boats[, k:=1]
cartesian_table <- merge(Sailors, Boats, by="k", allow.cartesian=TRUE)
setkey(cartesian_table, tid, sid, bid)
setkey(Reserves, tid, sid, bid)
cartesian_table[!Reserves][, list(bid, sname.x)]

This is a devision query, a quite complicated one that you may even not encounter often in practice. And see how neat are our R codes!