Resources


Here you can find our machine learning models as well as the datasets used on our published work.


Datasets

Aptoide Mobile Applications Dataset

  1. Global Application Information (80Mb zip, 441MB uncompressed JSON)
    Contains the most relevant attributes for each app
  2. Individual Application Meta Data (609MB zip, 2.6GB uncompressed JSON)
    Contains all the metadata
  3. Individual Application Category (5MB zip, 126MB uncompressed JSON)
    Contains the association between Apps and Categories

If you use this dataset (or part of it), please cite the paper:

João Coelho, António Neto, Miguel Tavares, Carlos Coutinho, Ricardo Ribeiro and Fernando Batista (2021) Semantic Search of Mobile Applications Using Word Embeddings. In 10th Symposium on Languages, Applications and Technologies (SLATE 2021). Dagstuhl, Germany. (Queirós, Ricardo, Pinto, Mário, Simões, Alberto, Portela, Filipe and Pereira, Maria João, Eds.) Schloss Dagstuhl — Leibniz-Zentrum für Informatik, pages 12:1-12:12. (paper)

Additional resources

The following resources will be published after the acceptance of our paper, recently submitted to Engineering Applications of Artificial Intelligence, and now under revision.

Subset of the Aptoide Mobile Applications Dataset – Smaller subset, containing approximately 6000 applications, restricted based on the number of days since the last update, the number of downloads, based on the ratings, and also including a final manual validation. It was created because the vast majority of the applications were not being updated frequently, were rarely downloaded, or were associated with low ratings or with a very few number of ratings, and therefore not of much relevance for a recommendation system

Aptoide User Information Dataset – Contains information concerning the installed applications for 1,034,104 users. It comprises a single-day snapshot of active users and their installed applications. Each user is identified by a hashed identifier, complying with Aptoide’s data protection policy. Applications are also represented by an identifier, which is consistent with the one used in the Aptoide Mobile Application Dataset.

RoBertapp model – Previous studies showed that out-of-the-box BERT-like embeddings are unsuitable for semantic-similarity tasks. As such, the RoBERTa (base) model was fine-tuned in two tasks. First, masked language modelling was used over the name and description of approximately 400.000 applications with English text. Then, the model was trained on a semantic similarity task. We leveraged the same dataset that was used for the masked language modelling objective, namely application names, descriptions, and categories. For this sort of training, a large set of queries labeled with relevant applications would be useful, but this is not available in the dataset. As such, the semantic training task consisted in distinguishing between real and fake descriptions for a given synthesized query, which is obtained by concatenating an applications name with its category. For instance, consider the application instagram. During training, the query instagram social would be compared to the real description and to randomly sampled fake descriptions. The standard binary cross entropy loss was used to enforce high similarity between positive pairs, and low similarity between negative ones.