BERTweet.BR โ€” A Pre-trained Language Model for Portuguese Tweets

๐Ÿš€ New Publication: BERTweet.BR โ€” A Pre-trained Language Model for Portuguese Tweets

We are excited to share our paper โ€œBERTweet.BR: a pre-trained language model for tweets in Portugueseโ€, now published in Neural Computing and Applications.

Check it out the ๐Ÿ“„ paper and the
๐Ÿค— Model on Hugging Face .

While most advances in neural language models focus on English, Portuguese โ€” despite being the sixth most spoken language in the world โ€” still lacks domain-specific large-scale resources. This gap is even more evident for social media, where Brazilian users are among the most active globally.

To address this, we introduce BERTweet.BR, the first large-scale pre-trained language model specifically designed for Brazilian Portuguese tweets. The model:

  • ๐Ÿง  Follows the BERTweet architecture (BERT-based)
  • ๐Ÿ“š Was trained from scratch using the RoBERTa pre-training procedure
  • ๐Ÿฆ Uses a corpus of 100 million Portuguese tweets
  • ๐Ÿ“Š Outperforms multilingual Transformers and BERTimbau on sentiment analysis

Tweets present unique challenges โ€” informal language, cultural references, code-switching, abbreviations, and character limits. BERTweet.BR is designed to better capture these characteristics and support downstream tasks in social media analysis for Portuguese.

The model is publicly available in the ๐Ÿค— Transformers library, and the code, documentation, and experimental results are open on GitHub.

We hope BERTweet.BR fosters new research in Portuguese NLP and strengthens analytical tools for social media in Brazil.