BERTweet.BR โ A Pre-trained Language Model for Portuguese Tweets
๐ New Publication: BERTweet.BR โ A Pre-trained Language Model for Portuguese Tweets
We are excited to share our paper โBERTweet.BR: a pre-trained language model for tweets in Portugueseโ, now published in Neural Computing and Applications.
Check it out the ๐ paper and the
๐ค Model on Hugging Face .
While most advances in neural language models focus on English, Portuguese โ despite being the sixth most spoken language in the world โ still lacks domain-specific large-scale resources. This gap is even more evident for social media, where Brazilian users are among the most active globally.
To address this, we introduce BERTweet.BR, the first large-scale pre-trained language model specifically designed for Brazilian Portuguese tweets. The model:
- ๐ง Follows the BERTweet architecture (BERT-based)
- ๐ Was trained from scratch using the RoBERTa pre-training procedure
- ๐ฆ Uses a corpus of 100 million Portuguese tweets
- ๐ Outperforms multilingual Transformers and BERTimbau on sentiment analysis
Tweets present unique challenges โ informal language, cultural references, code-switching, abbreviations, and character limits. BERTweet.BR is designed to better capture these characteristics and support downstream tasks in social media analysis for Portuguese.
The model is publicly available in the ๐ค Transformers library, and the code, documentation, and experimental results are open on GitHub.
We hope BERTweet.BR fosters new research in Portuguese NLP and strengthens analytical tools for social media in Brazil.