[ad_1]
Extreme Gradient Boosting (XGBoost) is an open-source library that provides an efficient and effective implementation of the gradient boosting algorithm.
Shortly after its development and initial release, XGBoost became the go-to method and often the key component in winning solutions for a range of problems in machine learning competitions.
Regression predictive modeling problems involve predicting a numerical value such as a dollar amount or a height. XGBoost can be used directly for regression predictive modeling.
In this tutorial, you will discover how to develop and evaluate XGBoost regression models in Python.
After completing this tutorial, you will know:
- XGBoost is an efficient implementation of gradient boosting that can be used for regression predictive modeling.
- How to evaluate an XGBoost regression model using the best practice technique of repeated k-fold cross-validation.
- How to fit a final model and use it to make a prediction on new data.
Let’s get started.
Tutorial Overview
This tutorial is divided into three parts; they are:
- Extreme Gradient Boosting
- XGBoost Regression API
- XGBoost Regression Example
Extreme Gradient Boosting
Gradient boosting refers to a class of ensemble machine learning algorithms that can be used for classification or regression predictive modeling problems.
Ensembles are constructed from decision tree models. Trees are added one at a time to the ensemble and fit to correct the prediction errors made by prior models. This is a type of ensemble machine learning model referred to as boosting.
Models are fit using any arbitrary differentiable loss function and gradient descent optimization algorithm. This gives the technique its name, “gradient boosting,” as the loss gradient is minimized as the model is fit, much like a neural network.
For more on gradient boosting, see the tutorial:
Extreme Gradient Boosting, or XGBoost for short, is an efficient open-source implementation of the gradient boosting algorithm. As such, XGBoost is an algorithm, an open-source project, and a Python library.
It was initially developed by Tianqi Chen and was described by Chen and Carlos Guestrin in their 2016 paper titled “XGBoost: A Scalable Tree Boosting System.”
It is designed to be both computationally efficient (e.g. fast to execute) and highly effective, perhaps more effective than other open-source implementations.
The two main reasons to use XGBoost are execution speed and model performance.
XGBoost dominates structured or tabular datasets on classification and regression predictive modeling problems. The evidence is that it is the go-to algorithm for competition winners on the Kaggle competitive data science platform.
Among the 29 challenge winning solutions 3 published at Kaggle’s blog during 2015, 17 solutions used XGBoost. […] The success of the system was also witnessed in KDDCup 2015, where XGBoost was used by every winning team in the top-10.
— XGBoost: A Scalable Tree Boosting System, 2016.
Now that we are familiar with what XGBoost is and why it is important, let’s take a closer look at how we can use it in our regression predictive modeling projects.
XGBoost Regression API
XGBoost can be installed as a standalone library and an XGBoost model can be developed using the scikit-learn API.
The first step is to install the XGBoost library if it is not already installed. This can be achieved using the pip python package manager on most platforms; for example:
You can then confirm that the XGBoost library was installed correctly and can be used by running the following script.
# check xgboost version import xgboost print(xgboost.__version__) |
Running the script will print your version of the XGBoost library you have installed.
Your version should be the same or higher. If not, you must upgrade your version of the XGBoost library.
It is possible that you may have problems with the latest version of the library. It is not your fault.
Sometimes, the most recent version of the library imposes additional requirements or may be less stable.
If you do have errors when trying to run the above script, I recommend downgrading to version 1.0.1 (or lower). This can be achieved by specifying the version to install to the pip command, as follows:
sudo pip install xgboost==1.0.1 |
If you require specific instructions for your development environment, see the tutorial:
The XGBoost library has its own custom API, although we will use the method via the scikit-learn wrapper classes: XGBRegressor and XGBClassifier. This will allow us to use the full suite of tools from the scikit-learn machine learning library to prepare data and evaluate models.
An XGBoost regression model can be defined by creating an instance of the XGBRegressor class; for example:
... # create an xgboost regression model model = XGBRegressor() |
You can specify hyperparameter values to the class constructor to configure the model.
Perhaps the most commonly configured hyperparameters are the following:
- n_estimators: The number of trees in the ensemble, often increased until no further improvements are seen.
- max_depth: The maximum depth of each tree, often values are between 1 and 10.
- eta: The learning rate used to weight each model, often set to small values such as 0.3, 0.1, 0.01, or smaller.
- subsample: The number of samples (rows) used in each tree, set to a value between 0 and 1, often 1.0 to use all samples.
- colsample_bytree: Number of features (columns) used in each tree, set to a value between 0 and 1, often 1.0 to use all features.
For example:
... # create an xgboost regression model model = XGBRegressor(n_estimators=1000, max_depth=7, eta=0.1, subsample=0.7, colsample_bytree=0.8) |
Good hyperparameter values can be found by trial and error for a given dataset, or systematic experimentation such as using a grid search across a range of values.
Randomness is used in the construction of the model. This means that each time the algorithm is run on the same data, it may produce a slightly different model.
When using machine learning algorithms that have a stochastic learning algorithm, it is good practice to evaluate them by averaging their performance across multiple runs or repeats of cross-validation. When fitting a final model, it may be desirable to either increase the number of trees until the variance of the model is reduced across repeated evaluations, or to fit multiple final models and average their predictions.
Let’s take a look at how to develop an XGBoost ensemble for regression.
XGBoost Regression Example
In this section, we will look at how we might develop an XGBoost model for a standard regression predictive modeling dataset.
First, let’s introduce a standard regression dataset.
We will use the housing dataset.
The housing dataset is a standard machine learning dataset comprising 506 rows of data with 13 numerical input variables and a numerical target variable.
Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 6.6. A top-performing model can achieve a MAE on this same test harness of about 1.9. This provides the bounds of expected performance on this dataset.
The dataset involves predicting the house price given details of the house’s suburb in the American city of Boston.
No need to download the dataset; we will download it automatically as part of our worked examples.
The example below downloads and loads the dataset as a Pandas DataFrame and summarizes the shape of the dataset and the first five rows of data.
# load and summarize the housing dataset from pandas import read_csv from matplotlib import pyplot # load dataset url = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv’ dataframe = read_csv(url, header=None) # summarize shape print(dataframe.shape) # summarize first few lines print(dataframe.head()) |
Running the example confirms the 506 rows of data and 13 input variables and a single numeric target variable (14 in total). We can also see that all input variables are numeric.
(506, 14) 0 1 2 3 4 5 … 8 9 10 11 12 13 0 0.00632 18.0 2.31 0 0.538 6.575 … 1 296.0 15.3 396.90 4.98 24.0 1 0.02731 0.0 7.07 0 0.469 6.421 … 2 242.0 17.8 396.90 9.14 21.6 2 0.02729 0.0 7.07 0 0.469 7.185 … 2 242.0 17.8 392.83 4.03 34.7 3 0.03237 0.0 2.18 0 0.458 6.998 … 3 222.0 18.7 394.63 2.94 33.4 4 0.06905 0.0 2.18 0 0.458 7.147 … 3 222.0 18.7 396.90 5.33 36.2
[5 rows x 14 columns] |
Next, let’s evaluate a regression XGBoost model with default hyperparameters on the problem.
First, we can split the loaded dataset into input and output columns for training and evaluating a predictive model.
... # split data into input and output columns X, y = data[:, :–1], data[:, –1] |
Next, we can create an instance of the model with a default configuration.
... # define model model = XGBRegressor() |
We will evaluate the model using the best practice of repeated k-fold cross-validation with 3 repeats and 10 folds.
This can be achieved by using the RepeatedKFold class to configure the evaluation procedure and calling the cross_val_score() to evaluate the model using the procedure and collect the scores.
Model performance will be evaluated using mean squared error (MAE). Note, MAE is made negative in the scikit-learn library so that it can be maximized. As such, we can ignore the sign and assume all errors are positive.
... # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring=‘neg_mean_absolute_error’, cv=cv, n_jobs=–1) |
Once evaluated, we can report the estimated performance of the model when used to make predictions on new data for this problem.
In this case, because the scores were made negative, we can use the absolute() NumPy function to make the scores positive.
We then report a statistical summary of the performance using the mean and standard deviation of the distribution of scores, another good practice.
... # force scores to be positive scores = absolute(scores) print(‘Mean MAE: %.3f (%.3f)’ % (scores.mean(), scores.std()) ) |
Tying this together, the complete example of evaluating an XGBoost model on the housing regression predictive modeling problem is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# evaluate an xgboost regression model on the housing dataset from numpy import absolute from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from xgboost import XGBRegressor # load the dataset url = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv’ dataframe = read_csv(url, header=None) data = dataframe.values # split data into input and output columns X, y = data[:, :–1], data[:, –1] # define model model = XGBRegressor() # define model evaluation method cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring=‘neg_mean_absolute_error’, cv=cv, n_jobs=–1) # force scores to be positive scores = absolute(scores) print(‘Mean MAE: %.3f (%.3f)’ % (scores.mean(), scores.std()) ) |
Running the example evaluates the XGBoost Regression algorithm on the housing dataset and reports the average MAE across the three repeats of 10-fold cross-validation.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
In this case, we can see that the model achieved a MAE of about 2.1.
This is a good score, better than the baseline, meaning the model has skill and close to the best score of 1.9.
We may decide to use the XGBoost Regression model as our final model and make predictions on new data.
This can be achieved by fitting the model on all available data and calling the predict() function, passing in a new row of data.
For example:
... # make a prediction yhat = model.predict(new_data) |
We can demonstrate this with a complete example, listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# fit a final xgboost model on the housing dataset and make a prediction from numpy import asarray from pandas import read_csv from xgboost import XGBRegressor # load the dataset url = ‘https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv’ dataframe = read_csv(url, header=None) data = dataframe.values # split dataset into input and output columns X, y = data[:, :–1], data[:, –1] # define model model = XGBRegressor() # fit model model.fit(X, y) # define new data row = [0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98] new_data = asarray([row]) # make a prediction yhat = model.predict(new_data) # summarize prediction print(‘Predicted: %.3f’ % yhat) |
Running the example fits the model and makes a prediction for the new rows of data.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
In this case, we can see that the model predicted a value of about 24.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Tutorials
Papers
APIs
Summary
In this tutorial, you discovered how to develop and evaluate XGBoost regression models in Python.
Specifically, you learned:
- XGBoost is an efficient implementation of gradient boosting that can be used for regression predictive modeling.
- How to evaluate an XGBoost regression model using the best practice technique of repeated k-fold cross-validation.
- How to fit a final model and use it to make a prediction on new data.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
[ad_2]
Source link