1 Introduction
Boosting^{1}^{1}1The idea of abclogitboost was included in an unfunded grant proposal submitted in early December 2008. algorithms [14, 4, 5, 2, 15, 7, 13, 6]
have become very successful in machine learning. This study revisits
logitboost[7] under the framework of adaptive base class boost (abcboost) in [10], for multiclass classification.We denote a training dataset by , where
is the number of feature vectors (samples),
is the th feature vector, and is the th class label, where in multiclass classification.Both logitboost[7] and mart (multiple additive regression trees)[6]
algorithms can be viewed as generalizations to the logistic regression model, which assumes the class probabilities
to be(1) 
While traditional logistic regression assumes , logitboost and mart adopt the flexible “additive model,” which is a function of terms:
(2) 
where , the base learner, is typically a regression tree. The parameters, and , are learned from the data, by maximum likelihood, which is equivalent to minimizing the negative loglikelihood loss
(3) 
where if and otherwise.
1.1 Logitboost
As described in Alg. 1, [7] builds the additive model (2
) by a greedy stagewise procedure, using a secondorder (diagonal) approximation, which requires knowing the first two derivatives of the loss function (
3) with respective to the function values . [7] obtained:(4) 
Those derivatives can be derived by assuming no relations among , to . However, [7] used the “sumtozero” constraint throughout the paper and they provided an alternative explanation. [7] showed (4) by conditioning on a “base class” and noticed the resultant derivatives are independent of the particular choice of the base class.
At each stage, logitboost fits an individual regression function separately for each class. This is analogous to the popular individualized regression approach in multinomial logistic regression, which is known [3, 1] to result in loss of statistical efficiency, compared to the full (conditional) maximum likelihood approach.
On the other hand, in order to use trees as base learner, the diagonal approximation appears to be a must, at least from the practical perspective.
1.2 Adaptive Base Class Boost
[10] derived the derivatives of (3) under the sumtozero constraint. Without loss of generality, we can assume that class 0 is the base class. For any ,
(5) 
The base class must be identified at each boosting iteration during training. [10] suggested an exhaustive procedure to adaptively find the best base class to minimize the training loss (3) at each iteration.
1.3 Our Contributions
We propose abclogitboost, by combining abcboost with robust logitboost[11]. Our extensive experiments will demonstrate that abclogitboost can considerably improve logitboost and abcmart on a variety of datasets.
2 Robust Logitboost
Our work is based on robust logitboost[11], which differs from the original logitboost algorithm. Thus, this section provides an introduction to robust logitboost.
[6, 8] commented that logitboost (Alg. 1) can be numerically unstable. The original paper[7] suggested some “crucial implementation protections” on page 17 of [7]:

In Line 5 of Alg. 1, compute the response by (if ) or (if ).

Bound the response by .
Note that the above operations are applied to each individual sample. The goal is to ensure that the response is not too large (Note that always). On the other hand, we should hope to use larger to better capture the data variation. Therefore, the thresholding occurs very frequently and it is expected that some of the useful information is lost.
[11] demonstrated that, if implemented carefully, logitboost is almost identical to mart. The only difference is the treesplitting criterion.
2.1 The TreeSplitting Criterion Using the SecondOrder Information
Consider weights , and response values , to , which are assumed to be ordered according to the sorted order of the corresponding feature values. The treesplitting procedure is to find the index , , such that the weighted mean square error (MSE) is reduced the most if split at . That is, we seek to maximize
where , , and . After simplification, we obtain
Plugging in , and as in Alg. 1, yields,
Because the computations involve as a group, this procedure is actually numerically stable.
In comparison, mart[6] only used the first order information to construct the trees, i.e.,
2.2 The Robust Logitboost Algorithm
Alg. 2 describes robust logitboost using the treesplitting criterion developed in Section 2.1. Note that after trees are constructed, the values of the terminal nodes are computed by
which explains Line 5 of Alg. 2.
2.2.1 Three Main Parameters: , , and
Alg. 2 has three main parameters, to which the performance is not very sensitive, as long as they fall in some reasonable range. This is a very significant advantage in practice.
The number of terminal nodes, , determines the capacity of the base learner. [6] suggested . [7, 18] commented that is unlikely. In our experience, for large datasets (or moderate datasets in highdimensions), is often a reasonable choice; also see [12].
The shrinkage, , should be large enough to make sufficient progress at each step and small enough to avoid overfitting. [6] suggested . Normally, is used.
The number of iterations, , is largely determined by the affordable computing time. A commonlyregarded merit of boosting is that overfitting can be largely avoided for reasonable and .
3 Adaptive Base Class Logitboost
The recently proposed abcboost [10] algorithm consists of two key components:

At each boosting iteration, adaptively select the base class according to the training loss. [10] suggested an exhaustive search strategy.
4 Experiments
Table 1 lists the datasets in our experiments, which include all the datasets used in [10], plus Mnist10k^{2}^{2}2We also did limited experiments on the original Mnist dataset (i.e., 60000 training samples and 10000 testing samples), the test misclassification error rate was about ..
dataset  # training  # test  # features  

Covertype  7  290506  290506  54 
Mnist10k  10  10000  60000  784 
Letter2k  26  2000  18000  16 
Letter4k  26  4000  16000  16 
Letter  26  16000  4000  16 
Pendigits  10  7494  3498  16 
Zipcode  10  7291  2007  256 
Optdigits  10  3823  1797  64 
Isolet  26  6218  1559  617 
Note that Zipcode, Otpdigits, and Isolet are very small datasets (especially the testing sets). They may not necessarily provide a reliable comparison of different algorithms. Since they are popular datasets, we nevertheless include them in our experiments.
Recall logitboost has three main parameters, , , and . Since overfitting is largely avoided, we simply let ( only for Covertype), unless the machine zero is reached. The performance is not sensitive to (as long as ). The performance is also not too sensitive to in a good range.
Ideally, we would like to show that, for every reasonable combination of and (using as large as possible), abclogitboost exhibits consistent improvement over (robust) logitboost. For most datasets, we experimented with every combination of and .
We provide a summary of the experiments after presenting the detailed results on Mnist10k.
4.1 Experiments on the Mnist10k Dataset
For this dataset, we experimented with every combination of and
. We trained till the loss (3) reached the machine zero, to exhaust the capacity of the learner so that we could provide a reliable comparison, up to iterations.
Figures 1 and 2 present the misclassification errors for every , , and :

Essentially no ovefitting is observed, especially for abclogitboost. This is why we simply report the smallest test error in Table 2.

The performance is not sensitive to .

The performance is not very sensitive to , for to 20.
Interestingly, abclogitboost sometimes needed more iterations to reach machine zero than logitboost. This can be explained in part by the fact that the “” in logitboost is not precisely the same “” in abclogitboost[10]. This is also why we would like to experiment with a range of values.
Table 2 summarizes the smallest test misclassification errors along with the relative improvements (denoted by ) of abclogitboost over logitboost. For most and , abclogitboost exhibits about smaller test misclassification errors than logitboost. The values range from to , although they are not reported in Table 2.
2911 2623 9.9  2884 2597 10.0  2876 2530 12.0  2878 2485 13.7  
2658 2255 15.2  2644 2240 15.3  2625 2224 15.3  2626 2212 15.8  
2536 2157 14.9  2541 2122 16.5  2521 2117 16.0  2533 2134 15.8  
2486 2118 14.8  2472 2111 14.6  2447 2083 14.9  2446 2095 14.4  
2435 2082 14.5  2424 2086 13.9  2420 2086 13.8  2426 2090 13.9  
2399 2083 13.2  2407 2081 13.5  2402 2056 14.4  2400 2048 14.7  
2421 2098 13.3  2405 2114 12.1  2382 2083 12.6  2364 2079 12.1  
2397 2086 13.0  2397 2079 13.3  2386 2080 12.8  2357 2085 11.5  
2384 2124 10.9  2409 2109 14.5  2404 2095 12.9  2372 2101 11.4 
The original abcboost paper[10] did not include experiments on Mnist10k. Thus, in this study, Table 3 summarizes the smallest test misclassification errors for mart and abcmart. Again, we can see very consistent and considerable improvement of abcmart over mart. Also, comparing Tables 2 and 3, we can see that abclogitboost also significantly improves abcmart.
3346 3054 8.7  3308 3009 9.0  3302 2855 13.5  3287 2792 15.1  
3176 2752 13.4  3074 2624 14.6  3071 2649 13.7  3089 2572 16.7  
3040 2557 15.9  3012 2552 15.2  3000 2529 15.7  2993 2566 14.3  
2979 2537 14.8  2941 2515 14.5  2957 2509 15.2  2947 2493 15.4  
2912 2498 14.2  2897 2453 15.3  2906 2475 14.8  2887 2469 14.5  
2907 2473 14.9  2886 2466 14.6  2874 2463 14.3  2864 2435 15.0  
2885 2466 14.5  2879 2441 15.2  2868 2459 14.2  2854 2451 14.1  
2852 2467 13.5  2860 2447 14.4  2865 2436 15.0  2852 2448 14.2  
2831 2438 13.9  2833 2440 13.9  2832 2425 14.4  2813 2434 13.5 
4.2 Summary of Test MisClassification Errors
Table 4 summarizes the test errors, which are the overall best (smallest) test misclassification errors. In the table, () is the relative improvement of test performance. The values tested the statistical significance if abclogitboost achieved smaller error rates than logitboost.
To compare abclogitboost with abcmart, Table 4 also includes the test errors for abcmart and the values (i.e., value (2)) for testing the statistical significance if abclogitboost achieved smaller error rates than abcmart. The comparisons indicate that there is a clear performance gap between abclogitboost and abcmart, especially on the large datasets.
Dataset  logit  abclogit  (%)  value  abcmart  vlaue (2) 

Covertype  10759  9693  9.9  10375  
Mnist10k  2357  2048  13.1  2425  
Letter2k  2257  1984  12.1  2180  
Letter4k  1220  1031  15.5  1126  0.017  
Letter  107  89  16.8  99  0.23  
Pendigits  109  90  17.4  100  0.23  
Zipcode  103  92  10.7  0.21  100  0.28 
Optdigits  49  38  22.5  0.11  43  0.29 
Isolet  62  55  11.3  0.25  64  0.20 
4.3 Experiments on the Covertype Dataset
Table 5 summarizes the smallest test misclassification errors of logitboost and abclogitboost, along with the relative improvements (). Since this is a fairly large dataset, we only experimented with and and .
logit  abclogit  

0.1  1000  10  29865  23774 (20.4) 
0.1  1000  20  19443  14443 (25.7) 
0.1  2000  10  21620  16991 (21.4) 
0.1  2000  20  13914  11336 (18.5) 
0.1  3000  10  17805  14295 (19.7) 
0.1  3000  20  12076  10399 (13.9) 
0.1  5000  10  14698  12185 (17.1) 
0.1  5000  20  10759  9693 (9.9) 
The results on Covertype are reported differently from other datasets. Covertype is fairly large. Building a very large model (e.g., boosting steps) would be expensive. Testing a very large model at runtime can be costly or infeasible for certain applications (e.g., search engines). Therefore, it is often important to examine the performance of the algorithm at much earlier boosting iterations. Table 5 shows that abclogitboost may improve logitboost as much as , as opposed to the reported in Table 4.
4.4 Experiments on the Letter2k Dataset
2576 2317 10.1  2535 2294 9.5  2545 2252 11.5  2523 2224 11.9  
2389 2133 10.7  2391 2111 11.7  2376 2070 12.9  2370 2064 12.9  
2325 2074 10.8  2299 2046 11.0  2298 2033 11.5  2271 2025 10.8  
2294 2041 11.0  2292 1995 13.0  2279 2018 11.5  2276 2000 12.1  
2314 2010 13.1  2304 1990 13.6  2311 2010 13.0  2268 2018 11.0  
2315 2015 13.0  2300 2003 12.9  2312 2003 13.4  2277 2024 11.1  
2302 2022 12.2  2394 1996 13.0  2276 3002 12.0  2257 1997 11.5  
2295 2041 11.1  2275 2021 11.2  2301 1984 13.8  2281 2020 11.4  
2280 2047 10.2  2267 2020 10.9  2294 2020 11.9  2306 2031 11.9 
4.5 Experiments on the Letter4k Dataset
1460 1295 11.3  1471 1232 16.2  1452 1199 17.4  1446 1204 16.7  
1390 1135 18.3  1394 1116 20.0  1382 1088 21.3  1374 1070 22.1  
1336 1078 19.3  1332 1074 19.4  1311 1062 19.0  1297 1042 20.0  
1289 1051 18.5  1285 1065 17.1  1280 1031 19.5  1273 1046 17.8  
1251 1055 15.7  1247 1065 14.6  1261 1044 17.2  1243 1051 15.4  
1247 1060 15.0  1233 1050 14.8  1251 1037 17.1  1244 1060 14.8  
1244 1070 14.0  1227 1064 13.3  1231 1044 15.2  1228 1038 15.5  
1243 1057 15.0  1250 1037 17.0  1234 1049 15.0  1220 1055 13.5  
1226 1078 12.0  1242 1069 13.9  1242 1054 15.1  1235 1051 14.9 
4.6 Experiments on the Letter Dataset
149 125 16.1  151 121 19.9  148 122 17.6  149 119 20.1  
130 112 13.8  132 107 18.9  133 101 24.1  129 102 20.9  
129 104 19.4  125 102 18.4  131 93 29.0  113 95 15.9  
114 101 11.4  115 100 13.0  123 96 22.0  117 93 20.5  
112 96 14.3  115 100 13.0  107 95 11.2  112 95 15.2  
110 96 12.7  113 98 13.3  113 94 16.8  110 89 19.1  
111 97 12.6  113 94 16.8  109 93 14.7  109 95 12.8  
114 95 16.7  112 92 17.9  111 96 13.5  117 93 20.5  
113 95 15.9  113 97 14.2  115 93 19.1  113 89 21.2 
4.7 Experiments on the Pendigits Dataset
119 92 22.7  120 93 22.5  118 90 23.7  119 92 22.7  
111 98 11.7  111 97 12.6  111 96 13.5  107 93 13.1  
116 97 16.4  117 94 19.7  115 95 17.4  114 93 18.4  
116 100 13.8  115 98 14.8  116 97 16.4  111 97 12.6  
117 98 16.2  113 98 13.2  113 98 13.3  114 98 14.0  
113 100 11.5  115 101 12.2  112 99 11.6  114 98 14.0  
112 100 10.7  118 97 18.8  112 98 12.5  113 96 15.0  
114 102 10.5  112 97 13.4  109 99 9.2  112 97 13.4  
112 106 5.4  116 102 12.1  113 100 11.5  107 100 6.5 
4.8 Experiments on the Zipcode Dataset
114 111 2.6  117 108 7.6  111 114 2.7  115 107 7.0  
109 101 7.3  107 102 4.6  106 98 7.5  110 99 10.0  
110 99 10.0  108 95 12.0  108 96 11.1  108 98 9.3  
111 97 12.6  110 94 14.5  106 97 8.5  103 94 8.7  
111 98 11.7  112 98 12.5  111 99 10.8  108 93 13.9  
112 100 10.7  108 99 8.3  110 97 11.8  114 92 19.3  
111 98 11.7  114 95 16.7  110 99 10.0  111 98 11.7  
112 96 14.2  114 98 14.0  109 101 7.3  113 98 13.3  
114 97 14.9  108 96 11.1  109 100 8.3  116 96 17.2 
4.9 Experiments on the Optdigits Dataset
52 41 21.2  50 42 16.0  50 40 20.0  49 41 16.3  
52 43 17.3  52 45 13.5  53 44 17.0  52 38 26.9  
55 44 20.0  55 44 20.0  53 45 15.1  54 45 16.7  
57 50 12.3  56 50 10.7  54 46 14.8  55 42 23.6  
52 50 3.8  55 48 12.7  54 47 13.0  54 46 14.8  
58 48 17.2  55 46 16.4  56 51 8.9  53 48 9.4  
61 54 11.5  57 51 10.5  58 49 15.5  56 46 17.9  
65 54 16.9  64 55 14.0  60 53 11.7  66 51 22.7  
63 61 3.2  61 56 8.2  64 55 14.1  64 55 14.1 
4.10 Experiments on the Isolet Dataset
For this dataset, [10] only experimented with for mart and abcmart. We add the experiment results for .
65 55 15.4  62 55 11.3  
67 59 11.9  69 58 15.9  
72 57 20.8  72 60 16.7  
73 61 16.4  75 62 17.3  
75 63 16.0  75 64 14.7  
74 65 12.2  75 60 20.0  
70 64 8.6  71 62 12.7  
74 67 9.5  73 62 15.1  
71 63 11.3  73 65 11.0 
81 68 16.1  80 64 20.0  
86 71 17.4  84 67 20.2  
86 72 16.3  84 72 14.3  
87 74 14.9  82 74 9.8  
93 73 21.5  91 74 18.7  
92 73 20.7  95 74 22.1  
91 73 19.8  94 78 17.0  
86 75 12.8  86 78 9.3  
95 79 16.8  87 78 10.3 
5 Conclusion
Multiclass classification is a fundamental task in machine learning. This paper presents the abclogitboost algorithm and demonstrates its considerable improvements over logitboost and abcmart on a variety of datasets.
There is one interesting UCI dataset named Poker, with 25K training samples and 1 million testing samples. Our experiments showed that abcboost could achieve an accuracy (i.e., the error rate ). Interestingly, using LibSVM, an accuracy of about was obtained^{3}^{3}3ChihJen Lin. Private communications in May 2009 and August 2009.. We will report the results in a separate paper.
References
 [1] Alan Agresti. Categorical Data Analysis. John Wiley & Sons, Inc., Hoboken, NJ, second edition, 2002.
 [2] Peter Bartlett, Yoav Freund, Wee Sun Lee, and Robert E. Schapire. Boosting the margin: a new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5):1651–1686, 1998.
 [3] Colin B. Begg and Robert Gray. Calculation of polychotomous logistic regression parameters using individualized regressions. Biometrika, 71(1):11–18, 1984.
 [4] Yoav Freund. Boosting a weak learning algorithm by majority. Inf. Comput., 121(2):256–285, 1995.
 [5] Yoav Freund and Robert E. Schapire. A decisiontheoretic generalization of online learning and an application to boosting. J. Comput. Syst. Sci., 55(1):119–139, 1997.

[6]
Jerome H. Friedman.
Greedy function approximation: A gradient boosting machine.
The Annals of Statistics, 29(5):1189–1232, 2001.  [7] Jerome H. Friedman, Trevor J. Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 28(2):337–407, 2000.
 [8] Jerome H. Friedman, Trevor J. Hastie, and Robert Tibshirani. Response to evidence contrary to the statistical view of boosting. Journal of Machine Learning Research, 9:175–180, 2008.

[9]
Yoonkyung Lee, Yi Lin, and Grace Wahba.
Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data.
Journal of the American Statistical Association, 99(465):67–81, 2004.  [10] Ping Li. Abcboost: Adaptive base class boost for multiclass classification. In ICML, Montreal, Canada, 2009.
 [11] Ping Li. Robust logitboost. Technical report, Department of Statistical Science, Cornell University, 2009.
 [12] Ping Li, Christopher J.C. Burges, and Qiang Wu. Mcrank: Learning to rank using classification and gradient boosting. In NIPS, Vancouver, BC, Canada, 2008.
 [13] Liew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. Boosting algorithms as gradient descent. In NIPS, 2000.
 [14] Robert Schapire. The strength of weak learnability. Machine Learning, 5(2):197–227, 1990.
 [15] Robert E. Schapire and Yoram Singer. Improved boosting algorithms using confidencerated predictions. Machine Learning, 37(3):297–336, 1999.
 [16] Ambuj Tewari and Peter L. Bartlett. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8:1007–1025, 2007.
 [17] Tong Zhang. Statistical analysis of some multicategory large margin classification methods. Journal of Machine Learning Research, 5:1225–1251, 2004.
 [18] Hui Zou, Ji Zhu, and Trevor Hastie. New multicategory boosting algorithms based on multicategory fisherconsistent losses. The Annals of Applied Statistics, 2(4):1290–1306, 2008.
Comments
There are no comments yet.