Data: Titanic Passenger Survival Data Set
We use the titanic
dataset with binary classification on survived. First
of all we store the train and test data in two data frames and remove
all rows that contains NAs:
# Store train and test data:
df_train = na.omit(titanic::titanic_train)
str(df_train)
#> 'data.frame': 714 obs. of 12 variables:
#> $ PassengerId: int 1 2 3 4 5 7 8 9 10 11 ...
#> $ Survived : int 0 1 1 1 0 0 0 1 1 1 ...
#> $ Pclass : int 3 1 3 1 3 1 3 3 2 3 ...
#> $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
#> $ Sex : chr "male" "female" "female" "female" ...
#> $ Age : num 22 38 26 35 35 54 2 27 14 4 ...
#> $ SibSp : int 1 1 0 1 0 0 3 0 1 1 ...
#> $ Parch : int 0 0 0 0 0 0 1 2 0 1 ...
#> $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
#> $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
#> $ Cabin : chr "" "C85" "" "C123" ...
#> $ Embarked : chr "S" "C" "S" "S" ...
#> - attr(*, "na.action")= 'omit' Named int [1:177] 6 18 20 27 29 30 32 33 37 43 ...
#> ..- attr(*, "names")= chr [1:177] "6" "18" "20" "27" ...In the next step we transform the response to a factor with more intuitive levels:
Initializing Model
Due to the R6 API it is necessary to create a new class
object which gets the data, the target as character, and the used loss.
Note that it is important to give an initialized loss object:
cboost = Compboost$new(data = df_train, target = "Survived", oob_fraction = 0.3)Use an initialized object for the loss gives the opportunity to use a loss initialized with a custom offset.
Adding Base-Learner
Adding new base-learners is also done by giving a character to indicate the feature. As second argument it is important to name an identifier for the factory since we can define multiple base-learner on the same source.
Numerical Features
For instance, we can define a spline and a linear base-learner of the same feature:
# Spline base-learner of age:
cboost$addBaselearner("Age", "spline", BaselearnerPSpline)
# Linear base-learner of age (degree = 1 with intercept is default):
cboost$addBaselearner("Age", "linear", BaselearnerPolynomial)Additional arguments can be specified after naming the base-learner:
# Spline base-learner of fare:
cboost$addBaselearner("Fare", "spline", BaselearnerPSpline, degree = 2,
n_knots = 14, penalty = 10, differences = 2)For references to the base learner documentation see functionality at the project page.
Categorical Features
When adding categorical features we use a dummy coded representation with a ridge penalty:
cboost$addBaselearner("Sex", "categorical", BaselearnerCategoricalRidge)Finally, we can check what factories are registered:
cboost$getBaselearnerNames()
#> [1] "Age_spline" "Age_linear" "Fare_spline" "Sex_categorical"Define Logger
Time logger
This logger logs the elapsed time. The time unit can be one of
microseconds, seconds or minutes.
The logger stops if max_time is reached. But we do not use
that logger as stopper here:
cboost$addLogger(logger = LoggerTime, use_as_stopper = FALSE, logger_id = "time",
max_time = 0, time_unit = "microseconds")Train Model and Access Elements
cboost$train(2000, trace = 250)
#> 1/2000 risk = 0.68 oob_risk = 0.66 time = 0
#> 250/2000 risk = 0.5 oob_risk = 0.5 time = 26400
#> 500/2000 risk = 0.48 oob_risk = 0.48 time = 55487
#> 750/2000 risk = 0.47 oob_risk = 0.48 time = 87644
#> 1000/2000 risk = 0.47 oob_risk = 0.48 time = 125157
#> 1250/2000 risk = 0.47 oob_risk = 0.48 time = 169462
#> 1500/2000 risk = 0.47 oob_risk = 0.48 time = 209153
#> 1750/2000 risk = 0.47 oob_risk = 0.48 time = 250716
#> 2000/2000 risk = 0.46 oob_risk = 0.48 time = 294164
#>
#>
#> Train 2000 iterations in 0 Seconds.
#> Final risk based on the train set: 0.46
cboost
#>
#>
#> Component-Wise Gradient Boosting
#>
#> Target variable: Survived
#> Number of base-learners: 4
#> Learning rate: 0.05
#> Iterations: 2000
#>
#> Offset: 0.3392
#>
#> LossBinomial: L(y,x) = log(1 + exp(-2yf(x))Objects of the Compboost class do have member functions
such as getCoef(), getInbagRisk() or
predict() to access the results:
str(cboost$getCoef())
#> List of 4
#> $ Age_spline : num [1:24, 1] -4.168 -1.131 -1.397 1.081 0.462 ...
#> ..- attr(*, "blclass")= chr "Rcpp_BaselearnerPSpline"
#> $ Fare_spline : num [1:17, 1] 1.015 0.322 -0.527 -1.705 -1.41 ...
#> ..- attr(*, "blclass")= chr "Rcpp_BaselearnerPSpline"
#> $ Sex_categorical: num [1:2, 1] 0.89 -1.39
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:2] "male" "female"
#> .. ..$ : NULL
#> ..- attr(*, "blclass")= chr "Rcpp_BaselearnerCategoricalRidge"
#> $ offset : num 0.339
str(cboost$getInbagRisk())
#> num [1:2001] 0.679 0.676 0.672 0.669 0.666 ...
str(cboost$predict())
#> num [1:500, 1] 2.104 -2.062 -0.467 -2.088 1.312 ...To obtain a vector of selected base learners use
getSelectedBaselearner():
table(cboost$getSelectedBaselearner())
#>
#> Age_spline Fare_spline Sex_categorical
#> 1145 521 334We can also access predictions directly from the response object
cboost$response and cboost$response_oob. Note
that $response_oob was created automatically when defining
an oob_fraction within the constructor:
oob_label = cboost$response_oob$getResponse()
oob_pred = cboost$response_oob$getPredictionResponse()
table(true_label = oob_label, predicted = oob_pred)
#> predicted
#> true_label -1 1
#> -1 55 27
#> 1 17 115Retrain the Model
To continue the training or set the whole model to another iteration
simply re-call train():
cboost$train(3000)
#>
#> You have already trained 2000 iterations.
#> Train 1000 additional iterations.
#>
#> 2025/3000 risk = 0.46 oob_risk = 0.48 time = 298880
#> 2100/3000 risk = 0.46 oob_risk = 0.48 time = 312730
#> 2175/3000 risk = 0.46 oob_risk = 0.48 time = 327079
#> 2250/3000 risk = 0.46 oob_risk = 0.48 time = 342173
#> 2325/3000 risk = 0.46 oob_risk = 0.48 time = 356363
#> 2400/3000 risk = 0.46 oob_risk = 0.48 time = 371352
#> 2475/3000 risk = 0.46 oob_risk = 0.48 time = 385944
#> 2550/3000 risk = 0.46 oob_risk = 0.48 time = 400836
#> 2625/3000 risk = 0.46 oob_risk = 0.48 time = 415903
#> 2700/3000 risk = 0.46 oob_risk = 0.49 time = 431397
#> 2775/3000 risk = 0.46 oob_risk = 0.49 time = 446305
#> 2850/3000 risk = 0.46 oob_risk = 0.49 time = 462330
#> 2925/3000 risk = 0.46 oob_risk = 0.49 time = 478116
#> 3000/3000 risk = 0.46 oob_risk = 0.49 time = 494244
str(cboost$getCoef())
#> List of 4
#> $ Age_spline : num [1:24, 1] -5.693 -0.779 -1.667 1.253 0.391 ...
#> ..- attr(*, "blclass")= chr "Rcpp_BaselearnerPSpline"
#> $ Fare_spline : num [1:17, 1] 0.973 0.344 -0.504 -1.826 -1.444 ...
#> ..- attr(*, "blclass")= chr "Rcpp_BaselearnerPSpline"
#> $ Sex_categorical: num [1:2, 1] 0.905 -1.409
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:2] "male" "female"
#> .. ..$ : NULL
#> ..- attr(*, "blclass")= chr "Rcpp_BaselearnerCategoricalRidge"
#> $ offset : num 0.339
str(cboost$getInbagRisk())
#> num [1:3001] 0.679 0.676 0.672 0.669 0.666 ...
table(cboost$getSelectedBaselearner())
#>
#> Age_spline Fare_spline Sex_categorical
#> 1931 695 374Next steps
- Have a look at the visualization capabilities of the package.
- See how other loss functions effect the model training.
