Does k-fold cross-validation for a given model training function and prediction function. The hyperparameter to be cross-validated is assumed to be `lambda`. The training and prediction functions are assumed to be able to fit/predict for multiple `lambda` values at the same time.

kfoldcv(
  x,
  y,
  train_fun,
  predict_fun,
  type.measure = "deviance",
  family = "gaussian",
  lambda = NULL,
  train_params = list(),
  predict_params = list(),
  train_row_params = c(),
  predict_row_params = c(),
  nfolds = 10,
  foldid = NULL,
  parallel = FALSE,
  grouped = TRUE,
  keep = FALSE,
  save_cvfits = FALSE
)

Arguments

x

Input matrix of dimension `nobs` by `nvars`; each row is an observation vector.

y

Response variable. Either a vector or a matrix, depending on the type of model.

train_fun

The model training function. This needs to take in an input matrix as `x` and a response variable as `y`.

predict_fun

The prediction function. This needs to take in the output of `train_fun` as `object` and new input matrix as `newx`.

type.measure

Loss function to use for cross-validation. See `availableTypeMeasures()` for possible values for `type.measure`. Note that the package does not check if the user-specified measure is appropriate for the family.

family

Model family; used to determine the correct loss function. One of "gaussian", "binomial", "poisson", "cox", "multinomial", "mgaussian", or a class "family" object.

lambda

Option user-supplied sequence representing the values of the hyperparameter to be cross-validated.

train_params

Any parameters that should be passed to `train_fun` to fit the model (other than `x` and `y`). Default is the empty list.

predict_params

Any other parameters that should be passed tp `predict_fun` to get predictions (other than `object` and `newx`). Default is the empty list.

train_row_params

A vector which is a subset of `names(train_params)`, indicating which parameters have to be subsetted in the CV loop (other than `x` and `y`. Default is `c()`. Other parameters which should probably be included here are "weights" (for observation weights) and "offset".

predict_row_params

A vector which is a subset of `names(predict_params)`, indicating which parameters have to be subsetted in the CV loop (other than `newx`). Default is `c()`. Other parameters which should probably be included here are "newoffset".

nfolds

Number of folds (default is 10). Smallest allowable value is 3.

foldid

An optional vector of values between `1` and `nfolds` (inclusive) identifying which fold each observation is in. If supplied, `nfolds` can be missing.

parallel

If `TRUE`, use parallel `foreach` to fit each fold. Must register parallel backend before hand. Default is `FALSE`.

grouped

This is an experimental argument, with default `TRUE`, and can be ignored by most users. For all models except `family = "cox"`, this refers to computing `nfolds` separate statistics, and then using their mean and estimated standard error to describe the CV curve. If `FALSE`, an error matrix is built up at the observation level from the predictions from the `nfolds` fits, and then summarized (does not apply to `type.measure="auc"`). For the "cox" family, `grouped=TRUE` obtains the CV partial likelihood for the Kth fold by subtraction; by subtracting the log partial likelihood evaluated on the full dataset from that evaluated on the on the (K-1)/K dataset. This makes more efficient use of risk sets. With `grouped=FALSE` the log partial likelihood is computed only on the Kth fold.

keep

If `keep = TRUE`, a prevalidated array is returned containing fitted values for each observation and each value of lambda. This means these fits are computed with this observation and the rest of its fold omitted. The `foldid` vector is also returned. Default is `keep = FALSE`.

save_cvfits

If `TRUE`, the model fits for each CV fold are returned as a list. Default is `FALSE`.

Value

An object of class "cvobj".

lambda

The values of lambda used in the fits.

cvm

The mean cross-validated error: a vector of length `length(lambda)`.

cvsd

Estimate of standard error of `cvm`.

cvup

Upper curve = `cvm + cvsd`.

cvlo

Lower curve = `cvm - cvsd`.

lambda.min

Value of `lambda` that gives minimum `cvm`.

lambda.1se

Largest value of `lambda` such that the error is within 1 standard error of the minimum.

index

A one-column matrix with the indices of `lambda.min` and `lambda.1se` in the sequence of coefficients, fits etc.

name

A text string indicating the loss function used (for plotting purposes).

fit.preval

If `keep=TRUE`, this is the array of prevalidated fits. Some entries can be `NA`, if that and subsequent values of `lambda` are not reached for that fold.

foldid

If `keep=TRUE`, the fold assignments used.

overallfit

Model fit for the entire dataset.

cvfitlist

If `save_cvfits=TRUE`, a list containing the model fits for each CV fold.

Details

The model training function is assumed to take in the data matrix as `x`, the response as `y`, and the hyperparameter to be cross-validated as `lambda`. It is assumed that in its returned output, the hyperparameter values actually used are stored as `lambda`. The prediction function is assumed to take in the new data matrix as `newx`, and a `lambda` sequence as `s`.

Examples

set.seed(1) x <- matrix(rnorm(500), nrow = 50) y <- rnorm(50) cv_fit <- kfoldcv(x, y, train_fun = glmnet::glmnet, predict_fun = predict)