A Practical Approach to Novel Class Discovery in Tabular Data: Appendix

:::info
Authors:
(1) Troisemaine Colin, Department of Computer Science, IMT Atlantique, Brest, France., and Orange Labs, Lannion, France;
(2) Reiffers-Masson Alexandre, Department of Computer Science, IMT Atlantique, Brest, France.;
(3) Gosselin Stephane, Orange Labs, Lannion, France;
(4) Lemaire Vincent, Orange Labs, Lannion, France;
(5) Vaton Sandrine, Department of Computer Science, IMT Atlantique, Brest, France.
:::
Table of Links
Abstract and Intro
Related work
Approaches
Hyperparameter optimization
Estimating the number of novel classes
Full training procedure
Experiments
Conclusion
Declarations
References
Appendix A: Additional result metrics
Appendix B: Hyperparameters
Appendix C: Cluster Validity Indices numerical results
Appendix D: NCD k-means centroids convergence study
Appendix A Additional result metrics
\
\
\
Appendix B Hyperparameters
The Table B3 shows the hyperparameters found by the full procedure described in Section 6.
\
Appendix C Cluster Validity Indices numerical results
An estimate of the number of clusters in the 7 datasets considered in this paper can be found in Table C4. Among the 6 CVIs reported here, the Silhouette coefficient performed the best. Furthermore, compared to the original feature space, its average estimation error significantly decreased in the latent space, validating our approach. For some datasets, the Davies-Bouldin index continued to decrease and the Dunn index continued to increase as the number of clusters increased, resulting in very large overestimations. Note that the estimates of the number of novel classes in Table C4 are
\
\
\
\
not needed in the experiments of Section 7.2.2, since Algorithm 1 directly incorporates such estimates in the training procedure. This table has only helped us to identify the most appropriate CVI for our problem. The only exception is the TabularNCD method, which requires an a priori estimation of the number of novel classes in the original feature space.
\
\
\
Appendix D NCD k-means centroids convergence study
In this appendix, we aim to determine how to achieve the best performance with NCD k-means. Specifically, after the centroid initialization described in Section 3.2, we investigate: (1) whether it is more effective to update the centroids of both known and novel classes, or only the centroids of novel classes; (2) whether the centroids need to be updated using data from both known and novel classes, or only using data from novel classes. The results are presented in Table D5 and show that for 5 out of 7 datasets, the best results are obtained when only the centroids of the novel classes are updated on the unlabeled data. Updating the centroids of the known classes always leads to worse performance, as the class labels are not used in this process. Thus, the centroids of the known classes run the risk of capturing data from the novel classes (and vice versa).
\
\
\
\
\
\
:::info
This paper is available on arxiv under CC 4.0 license.
:::
\
Welcome to Billionaire Club Co LLC, your gateway to a brand-new social media experience! Sign up today and dive into over 10,000 fresh daily articles and videos curated just for your enjoyment. Enjoy the ad free experience, unlimited content interactions, and get that coveted blue check verification—all for just $1 a month!
Account Frozen
Your account is frozen. You can still view content but cannot interact with it.
Please go to your settings to update your account status.
Open Profile Settings