kfoldcv.Rd
Does k-fold cross-validation for a given model training function and prediction function. The hyperparameter to be cross-validated is assumed to be `lambda`. The training and prediction functions are assumed to be able to fit/predict for multiple `lambda` values at the same time.
kfoldcv( x, y, train_fun, predict_fun, type.measure = "deviance", family = "gaussian", lambda = NULL, train_params = list(), predict_params = list(), train_row_params = c(), predict_row_params = c(), nfolds = 10, foldid = NULL, parallel = FALSE, grouped = TRUE, keep = FALSE, save_cvfits = FALSE )
x | Input matrix of dimension `nobs` by `nvars`; each row is an observation vector. |
---|---|
y | Response variable. Either a vector or a matrix, depending on the type of model. |
train_fun | The model training function. This needs to take in an input matrix as `x` and a response variable as `y`. |
predict_fun | The prediction function. This needs to take in the output of `train_fun` as `object` and new input matrix as `newx`. |
type.measure | Loss function to use for cross-validation. See `availableTypeMeasures()` for possible values for `type.measure`. Note that the package does not check if the user-specified measure is appropriate for the family. |
family | Model family; used to determine the correct loss function. One of "gaussian", "binomial", "poisson", "cox", "multinomial", "mgaussian", or a class "family" object. |
lambda | Option user-supplied sequence representing the values of the hyperparameter to be cross-validated. |
train_params | Any parameters that should be passed to `train_fun` to fit the model (other than `x` and `y`). Default is the empty list. |
predict_params | Any other parameters that should be passed tp `predict_fun` to get predictions (other than `object` and `newx`). Default is the empty list. |
train_row_params | A vector which is a subset of `names(train_params)`, indicating which parameters have to be subsetted in the CV loop (other than `x` and `y`. Default is `c()`. Other parameters which should probably be included here are "weights" (for observation weights) and "offset". |
predict_row_params | A vector which is a subset of `names(predict_params)`, indicating which parameters have to be subsetted in the CV loop (other than `newx`). Default is `c()`. Other parameters which should probably be included here are "newoffset". |
nfolds | Number of folds (default is 10). Smallest allowable value is 3. |
foldid | An optional vector of values between `1` and `nfolds` (inclusive) identifying which fold each observation is in. If supplied, `nfolds` can be missing. |
parallel | If `TRUE`, use parallel `foreach` to fit each fold. Must register parallel backend before hand. Default is `FALSE`. |
grouped | This is an experimental argument, with default `TRUE`, and can be ignored by most users. For all models except `family = "cox"`, this refers to computing `nfolds` separate statistics, and then using their mean and estimated standard error to describe the CV curve. If `FALSE`, an error matrix is built up at the observation level from the predictions from the `nfolds` fits, and then summarized (does not apply to `type.measure="auc"`). For the "cox" family, `grouped=TRUE` obtains the CV partial likelihood for the Kth fold by subtraction; by subtracting the log partial likelihood evaluated on the full dataset from that evaluated on the on the (K-1)/K dataset. This makes more efficient use of risk sets. With `grouped=FALSE` the log partial likelihood is computed only on the Kth fold. |
keep | If `keep = TRUE`, a prevalidated array is returned containing fitted values for each observation and each value of lambda. This means these fits are computed with this observation and the rest of its fold omitted. The `foldid` vector is also returned. Default is `keep = FALSE`. |
save_cvfits | If `TRUE`, the model fits for each CV fold are returned as a list. Default is `FALSE`. |
An object of class "cvobj".
The values of lambda used in the fits.
The mean cross-validated error: a vector of length `length(lambda)`.
Estimate of standard error of `cvm`.
Upper curve = `cvm + cvsd`.
Lower curve = `cvm - cvsd`.
Value of `lambda` that gives minimum `cvm`.
Largest value of `lambda` such that the error is within 1 standard error of the minimum.
A one-column matrix with the indices of `lambda.min` and `lambda.1se` in the sequence of coefficients, fits etc.
A text string indicating the loss function used (for plotting purposes).
If `keep=TRUE`, this is the array of prevalidated fits. Some entries can be `NA`, if that and subsequent values of `lambda` are not reached for that fold.
If `keep=TRUE`, the fold assignments used.
Model fit for the entire dataset.
If `save_cvfits=TRUE`, a list containing the model fits for each CV fold.
The model training function is assumed to take in the data matrix as `x`, the response as `y`, and the hyperparameter to be cross-validated as `lambda`. It is assumed that in its returned output, the hyperparameter values actually used are stored as `lambda`. The prediction function is assumed to take in the new data matrix as `newx`, and a `lambda` sequence as `s`.