What is stacking?
Stacking is one of the three widely used ensemble methods in Machine Learning and its applications. The overall idea of stacking is to train several models, usually with different algorithm types (aka base-learners), on the train data, and then rather than picking the best model, all the models are aggregated/fronted using another model (meta learner), to make the final prediction. The inputs for the meta-learner is the prediction outputs of the base-learners.
How to Train?
Training a stacking model is a bit tricky, but is not as hard as it sounds. All it requires is some similar steps as k-fold cross-validation. First of all, divide the original data set into two sets: Train set and Test set. We won't be even touching the Test Set during our training process of the Stacking model. Now we need to divide the Train set into k-number (say 10) of folds. If the original dataset contains N data points, then each fold will contain N/k number of data points. (it is not mandatory to have equal size folds.)
Keep one of the folds aside, and train the base models, using the remaining folds. The kept-aside fold will be treated as the testing data for this step.
Then, predict the valued for the remaining fold (10th fold), using all the M models trained. So this will result in M number of predictions for each data point in the 10th fold. Now we have N/10 data points sets (prediction sets), each with M number of fields (predictions coming from the M number of models). i.e: matrix with N/10 * M.
Now, iterate the above process by changing the kept-out fold (from 1 to 10). At the end of all the iterations, we would be having N number of prediction results sets, which corresponds to each data point in the original training set, along with the actual value of the field we predict.
This will be the input data set to our meta-learner. Now we can train the meta learner, using any suitable native algorithm, by sending each prediction as an input field and the original value as the output field.
Once all the base-learners and meta-learner are trained, prediction follows the same idea as the training, except the k-folds. Simply, for a given input data point, all we need to do is to pass it through the M base-learners and get M number of predictions, and send those M predictions through the meta-learner as inputs, as in Figure 1.