Incremental training with XGBoost
Install dependencies
pip install scikit-learn xgboost
Save your model after you train on the first batch. Then, on successive runs, provide the xgb.train method with the filepath of the saved model.
First, split the boston dataset into training and testing sets. Then split the training set into halves. Fit a model with the first half and get a score that will serve as a benchmark. Then fit two models with the second half; one model will have the additional parameter xgb_model. If passing in the extra parameter didn't make a difference, then we would expect their scores to be similar.. But, fortunately, the new model seems to perform much better than the first.
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
X = load_boston()['data']
y = load_boston()['target']
# split data into training and testing sets
# then split training set in half
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
X_train_1, X_train_2, y_train_1, y_train_2 = train_test_split(
X_train, y_train, test_size=0.5, random_state=0
)
xg_train_1 = xgb.DMatrix(X_train_1, label=y_train_1)
xg_train_2 = xgb.DMatrix(X_train_2, label=y_train_2)
xg_test = xgb.DMatrix(X_test, label=y_test)
params = {'objective': 'reg:squarederror', 'verbose': False}
model_1 = xgb.train(params, xg_train_1, 30)
model_1.save_model('model_1.model')
# ================= train two versions of the model =====================#
model_2_v1 = xgb.train(params, xg_train_2, 30)
model_2_v2 = xgb.train(params, xg_train_2, 30, xgb_model='model_1.model')
print(mean_squared_error(model_1.predict(xg_test), y_test)) # benchmark
print(mean_squared_error(model_2_v1.predict(xg_test), y_test)) # "before"
print(mean_squared_error(model_2_v2.predict(xg_test), y_test)) # "after"