How silent coercion steepens R’s learning curve

People often ask me about the relative merits of R versus other statistics & mathematical programming environments (Matlab, SPSS, and so forth).  While I love R and use it almost every day, I always insist that it has a(n unnecessarily) steep learning curve and can bite you in ways that are hard to recognize.  I also say that R’s practice of silent coercion and sometimes misleading textual presentation of objects contributes to this steep learning curve. Today I stumbled across a very concrete example which has bitten me for the umpteenth time.  The example involves the functiontapply() and conversion to factors. Suppose that we have a number of observations, each of which belongs to one of K groups:

> K <- 6
> dat <- data.frame(group=rep(1:K, each=4))

There are underlying group-mean differences, and noise at the level of the observations as well:

> group.means <- 100 + runif(K, min=-20, max=20)
> dat$y <- group.means[dat$group] + rnorm(dim(dat)[1], 0, 40)

A very common thing to want to do is to compute some aggregate statistic of each group, and then record the group-specific aggregate statistic on each individual observation. This is naturally done with tapply():

> observed.group.means <- with(dat,tapply(y,group,mean)) ## compute the observed mean for each group
> dat$group.mean <- observed.group.means[dat$group] ## side note: if you have more than one grouping factor, you need to use cbind() inside the brackets

No trouble so far, and inspection of the new vectors shows everything’s ok:


> observed.group.means
1 2 3 4 5 6
75.32891 102.14953 89.51449 102.49337 90.25102 117.25358
> tail(dat)
group y group.mean
19 5 105.34542 90.25102
20 5 119.21220 90.25102
21 6 164.36802 117.25358
22 6 41.89392 117.25358
23 6 130.04878 117.25358
24 6 132.70359 117.25358

But now, let’s imagine that your group indices happen to start with some number other than 1 (e.g., you found that group 1 was tainted and you throw it  out before computing the aggregate statistic):

> dat1 <- subset(dat, group > 1)
> observed.group.means1 <- with(dat1,tapply(y,group,mean))
> dat1$group.mean <- observed.group.means1[dat1$group]

Everything seems ok. But if you didn’t bother to inspect the results, you might never guess that something is horribly wrong:

> observed.group.means1
2 3 4 5 6
102.14953 89.51449 102.49337 90.25102 117.25358
> tail(dat1)
group y group.mean
19 5 105.34542 117.2536
20 5 119.21220 117.2536
21 6 164.36802 NA
22 6 41.89392 NA
23 6 130.04878 NA
24 6 132.70359 NA

What happened? This unhappy situation results from two sins and one proper behavior:

  1. tapply() silently coerces elements of its second argument into factors. So the line

    > observed.group.means1 <- with(dat1,tapply(y,group,mean))

    turned the numeric vector dat1$group into a factor with 5 levels, named “2”, “3”, “4”, “5”, and “6”. observed.group.means1 in turn is an array of length 5. [SIN]

  2. Indexing a vector or array with a numeric vector does not ever coerce to factor. So the operation

    > observed.group.means1[dat1$group]

    gives the mean for group 2 when dat1$group is 1, the mean for group 3, when dat1$group is 2, and so forth. [PROPER BEHAVIOR]

  3. When you try to take elements of a vector using indices outside of the vector’s range, R silently gives you NAs back:

    > c(1,2)[3]
    [1] NA

    So the operation

    > observed.group.means1[dat1$group]

    silently gives NA for elements where dat1$group is 6, since observed.group.means1 is only of length 5. [SIN]

Now, there are good arguments for each of these behaviors (coercion to factor by tapply(), non-coercion for numeric indexation of vectors, and — most questionably IMHO — returning NA for indexation outside a vector’s range). But the fact that R does each of these things silently makes it easy to lead even the expert user astray — and note that in this case, the user might not discover any sign of an error until much further downstream (if at all). From the user’s point of view, one way to avoid these dangers is to ensure that your data are of the correct type from the outset: if the differences among your groups are categorical rather than numeric, then you should be using a factor in the first place to represent them, not a numeric vector. But it is incredibly easy to make this mistake, especially because so many other programs use integer values to encode categorical variables upon data export. R could make life easier for all its users by providing explicit warning messages in some of these cases — coercion, indexing from outside the range of a vector — instead of being silent.

Post a Comment

You must be logged in to post a comment.