25.4 Missing values
If you’re working with missing values, you need to know two things:
- How R’s missing values behave in C++’s scalars (e.g.,
double
). - How to get and set missing values in vectors (e.g.,
NumericVector
).
25.4.1 Scalars
The following code explores what happens when you take one of R’s missing values, coerce it into a scalar, and then coerce back to an R vector. Note that this kind of experimentation is a useful way to figure out what any operation does.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List scalar_missings() {int int_s = NA_INTEGER;
String chr_s = NA_STRING;bool lgl_s = NA_LOGICAL;
double num_s = NA_REAL;
return List::create(int_s, chr_s, lgl_s, num_s);
}
str(scalar_missings())
#> List of 4
#> $ : int NA
#> $ : chr NA
#> $ : logi TRUE
#> $ : num NA
With the exception of bool
, things look pretty good here: all of the missing values have been preserved. However, as we’ll see in the following sections, things are not quite as straightforward as they seem.
25.4.1.1 Integers
With integers, missing values are stored as the smallest integer. If you don’t do anything to them, they’ll be preserved. But, since C++ doesn’t know that the smallest integer has this special behaviour, if you do anything to it you’re likely to get an incorrect value: for example, evalCpp('NA_INTEGER + 1')
gives -2147483647.
So if you want to work with missing values in integers, either use a length 1 IntegerVector
or be very careful with your code.
25.4.1.2 Doubles
With doubles, you may be able to get away with ignoring missing values and working with NaNs (not a number). This is because R’s NA is a special type of IEEE 754 floating point number NaN. So any logical expression that involves a NaN (or in C++, NAN) always evaluates as FALSE:
evalCpp("NAN == 1")
#> [1] FALSE
evalCpp("NAN < 1")
#> [1] FALSE
evalCpp("NAN > 1")
#> [1] FALSE
evalCpp("NAN == NAN")
#> [1] FALSE
(Here I’m using evalCpp()
which allows you to see the result of running a single C++ expression, making it excellent for this sort of interactive experimentation.)
But be careful when combining them with Boolean values:
evalCpp("NAN && TRUE")
#> [1] TRUE
evalCpp("NAN || FALSE")
#> [1] TRUE
However, in numeric contexts NaNs will propagate NAs:
evalCpp("NAN + 1")
#> [1] NaN
evalCpp("NAN - 1")
#> [1] NaN
evalCpp("NAN / 1")
#> [1] NaN
evalCpp("NAN * 1")
#> [1] NaN
25.4.2 Strings
String
is a scalar string class introduced by Rcpp, so it knows how to deal with missing values.
25.4.3 Boolean
While C++’s bool
has two possible values (true
or false
), a logical vector in R has three (TRUE
, FALSE
, and NA
). If you coerce a length 1 logical vector, make sure it doesn’t contain any missing values; otherwise they will be converted to TRUE. An easy fix is to use int
instead, as this can represent TRUE
, FALSE
, and NA
.
25.4.4 Vectors
With vectors, you need to use a missing value specific to the type of vector, NA_REAL
, NA_INTEGER
, NA_LOGICAL
, NA_STRING
:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List missing_sampler() {return List::create(
NumericVector::create(NA_REAL),
IntegerVector::create(NA_INTEGER),
LogicalVector::create(NA_LOGICAL),
CharacterVector::create(NA_STRING)
); }
str(missing_sampler())
#> List of 4
#> $ : num NA
#> $ : int NA
#> $ : logi NA
#> $ : chr NA
25.4.5 Exercises
Rewrite any of the functions from the first exercise of Section 25.2.6 to deal with missing values. If
na.rm
is true, ignore the missing values. Ifna.rm
is false, return a missing value if the input contains any missing values. Some good functions to practice with aremin()
,max()
,range()
,mean()
, andvar()
.Rewrite
cumsum()
anddiff()
so they can handle missing values. Note that these functions have slightly more complicated behaviour.