Figure 1: Full range and mean value of the number of accepted full research papers for top 20 affiliations at KDD between 2011 and 2015
Second, all good predictive models need their trustworthy baseline. In the first phase of the competition we aim to build a solid baseline model. For this, we compute the probabilities that full research papers belong to affiliations, based on their number of accepted papers across all past five years. We then rank the affiliations according to these probabilities and this becomes our baseline model against which we will compare all our other models.
In the second phase and also in the final phase of the competition, we experiment with two classes of models: mixed models and gradient boosted decision trees (GBDT). The former is more interpretable while the latter has more predictive power. We set the relevance of each affiliation as the target of our predictions in whatever models we try. The relevance is, if you remember, the sum of the fractional contributions by all the authors of an affiliation for a conference in a year. You can imagine now our dataset is a large matrix where the columns are our features (I’ll explain which features we use in just a bit) and the rows are the observations. Basically each row holds the features values for each conference and each affiliation and each year.
More data is always better, so the next thing we try is to increase the dataset size by using information from more years and conferences related to each of the conferences we are interested in making predictions for. What does ‘related’ mean? Most researchers publish their work at different conferences. However they specialize in a specific area and so the conferences they publish at have to be more or less similar at least in a few respects. We use authors and keywords from the papers in MAG to cluster similar conferences together. It is a straightforward way to grow the dataset even more. The intuition behind this is the information from related conferences will enforce the patterns discovered by the models, because prolific affiliations are prolific across all conferences they submit to, not just at one of them. We compute the Jaccard similarity for both authors and keywords for any pair of conferences in the MAG. From this, we can determine which conferences are for example most similar to KDD in terms of common authors and common papers’ keywords. We experimented with different numbers of related conferences allows us to expand our training dataset immensely and greatly improve our predictions.
Ok, with the dataset in place comes the fun part: feature engineering. We have experimented with many other features, but will only mention here the ones which worked best for us. We created features meant to capture each affiliation’s long and short-term relevance trends: - Stats of all previous relevance scores (std, sum, mean, median, min, max) - Previous relevance scores computed in windows from previous year up to 4 years ago - Stats of previous relevance scores (std, sum, mean, median, min, max) computed in windows from previous year up to 4 years ago - Drift trend of previous relevance scores - Exponential weighted moving average of previous relevance scores with estimated smoothing parameter - Exponential weighted moving average of previous relevance scores, computed with a fixed smoothing parameter
Dataset+Features+Baseline+Tuning=Profit Final step in any competition is to polish your predictions through tuning. In the final phase of the competition the organizers chose 3 well-known conferences for validation: FSE, MM and MOBICOM. We search for the features configuration for which the GBDT model gives the best predictions for each of the conferences. We perform a grid search on different combinations of features and numbers of related conferences. Thus the training dataset was of course different for each of the 3 conferences. Although the final feature sets was different overall between the conferences, some of the features do well across all conferences: the exponential smoothing features improved the final predictions for all of them.
Figure 2 shows the corresponding results of the best features configurations for each conference. We used the tuning process to chose the best feature sets for each of the conferences, such that all the scores of the GBDT model are above the probabilities model baseline.
Figure 2: Results for the best configuration of the engineered features
Conclusions and tips for ML competitions This was my first shot at the KDD Cup competition and it couldn’t have ended better. I believe our systematic way to build the models coupled with some careful feature engineering was what in the end set us apart from the other teams. So here are some pointers I have for you to have in mind. You probably read all of them before but they cannot be stressed enough.
P.S.1. The approach is fully described in the paper Predicting the future relevance of research institutions - The winning solution of the KDD Cup 2016. Parts of this article were shamelessly taken from the paper. Some things were only superficially mentioned in this article. So I encourage you to read the entire paper for a full overview of the solution.
P.S.2. This entire article was also published on the Adform Engineering Blog