news

Feb 12, 2026 We are going to LREC :sparkles: :smile:! We have two papers accepted in LREC 2026, one about Q&A for ESG, in partnership with ICT and the other in collaboration with LALIC/UFSCar, about Portuguese metaphors.
Feb 06, 2026 We are going to PROPOR :sparkles: :smile:! We have five papers accepted at PROPOR 2026, including topics like auditing the automatic metrics when evaluating portuguese financial commentaries, a new text simplification dataset in PT-BR for the legal domain, evaluating PT-BR health-related reasoning and RAG in Q&A, a new corpus of Brazilian election transcriptions, and normalization of old texts in PT-PT with LLMs.
Feb 04, 2026 We are happy to share that our paper “Select First, Transfer Later: Choosing a Proper Dataset for SRL and GNN Based Transfer Learning” has been published in Machine Learning Journal. Traditional machine learning models often ignore the relational structure present in many real-world domains. Approaches such as Statistical Relational Learning (SRL) and Graph Neural Networks (GNNs) address this by explicitly modeling dependencies between entities. However, like traditional models, they typically assume that training and testing data come from the same distribution — an assumption that often fails in practice. Transfer Learning helps by reusing knowledge from one domain in another. But an important question remains largely overlooked: 👉 From where should we transfer? In this work, we propose a principled method to estimate the suitability of transfer between relational domains using Kullback–Leibler (KL) divergence, computed from a Naive Bayes distribution of the target relational data and each candidate source model (SRL or GNN). 🔎 Main contribution:
  • A strategy to select the most appropriate source domain before performing transfer.
  • Empirical evaluation with state-of-the-art SRL and GNN transfer learning algorithms.
  • Experimental evidence that selecting the right source significantly improves performance.
Our results reinforce that, in relational and graph-based learning, choosing the source domain is just as important as deciding what and how to transfer.
Jan 10, 2026 I will be one of the 19th Symposium in Information and Human Language Technology (STIL) PC chairs. The deadline is on March 30th, on JEMS. Check it out the conference website for more information. STILL will be colocated with BRACIS, KDMIlE and WESACC.
Jan 03, 2026 We are thrilled to share that Vítor Lourenço is heading to Rabat, Morocco 🇲🇦 in March, as our paper has been accepted to EACL 2026 ✨ 👉 “KG-CRAFT: Knowledge Graph-based Contrastive Reasoning with LLMs for Enhancing Automated Fact-checking This paper will be presented at the EACL main conference, which makes this achievement even more special. 📄 In KG-CRAFT, we propose a novel approach for automated fact-checking that combines knowledge graphs with contrastive reasoning in large language models. By constructing a claim-centric knowledge graph and generating contrastive questions, the method guides LLMs toward better evidence selection and reasoning, leading to more reliable and accurate claim verification. The approach achieves state-of-the-art results on real-world fact-checking benchmarks.
Oct 08, 2025 🦜 Exploring Brazil’s LLM Fauna: Evaluating Generative Performance in Portuguese We are excited to share our new paper, “Exploring Brazil’s LLM Fauna: Investigating the Generative Performance of Large Language Models in Portuguese.” As Large Language Models (LLMs) become increasingly embedded in real-world applications, their evaluation still relies heavily on narrow, mostly English-centered benchmarks. These traditional evaluations often neglect essential generative aspects such as discourse coherence, adequacy, and linguistic transformations — all crucial for practical use. In this work, we provide a comprehensive evaluation of Brazilian Portuguese LLMs across three core Natural Language Generation tasks:
  • 📝 Text Summarization
  • ✂️ Sentence Simplification
  • ❓ Generative Question Answering
We evaluate six Brazilian models and compare them with GPT-4o, combining automatic metrics, an LLM-as-a-judge framework, and human evaluation. 🔎 Key findings:
  • GPT-4o achieves the strongest overall generative performance in Portuguese.
  • The Sabiá-3 family follows closely behind.
  • The open-weight model Tucano stands out for its computational efficiency, making it a strong candidate for deployment in resource-constrained environments.
All experimental code is publicly available: 👉 https://github.com/MeLLL-UFF/brfauna-gen-eval This work contributes to a broader understanding of how LLMs perform beyond English and supports more realistic, generation-focused evaluation pipelines for Portuguese NLP.
Apr 03, 2025 We are going to ACL 2025 :sparkles: :smile:! Our long paper “Evaluating LLMs for Portuguese Sentence Simplification with Linguistic Insights” has been accepted to the main conference of ACL 2025, and we are very excited to share this work with the community. This publication represents an important milestone in our ongoing research efforts and contributes to international discussions in Natural Language Processing in Portuguese. In this study, we evaluated 26 advanced AI language models to see how well they simplify sentences in Portuguese. We also compared them with two models specially trained for this task, one proposed by us in the ACL 2024 paper. To do this, we used three different collections of texts, including a new dataset we created with real sentences from Brazilian government agencies. We analyzed the results using automatic measures and detailed linguistic evaluation, and also manually reviewed some examples. Our main finding is that although open-source AI models perform very well, commercial (closed-source) models still achieve the best results when simplifying Portuguese texts. But all of them still have a gap in linguistic features when compared to human evaluation. We look forward to presenting the paper at ACL and engaging with the community around its ideas and findings!
Dec 24, 2024 🚀 New Publication: BERTweet.BR — A Pre-trained Language Model for Portuguese Tweets We are excited to share our paper “BERTweet.BR: a pre-trained language model for tweets in Portuguese”, now published in Neural Computing and Applications. Check it out the 📄 paper and the
🤗 Model on Hugging Face . While most advances in neural language models focus on English, Portuguese — despite being the sixth most spoken language in the world — still lacks domain-specific large-scale resources. This gap is even more evident for social media, where Brazilian users are among the most active globally. To address this, we introduce BERTweet.BR, the first large-scale pre-trained language model specifically designed for Brazilian Portuguese tweets. The model:
  • 🧠 Follows the BERTweet architecture (BERT-based)
  • 📚 Was trained from scratch using the RoBERTa pre-training procedure
  • 🐦 Uses a corpus of 100 million Portuguese tweets
  • 📊 Outperforms multilingual Transformers and BERTimbau on sentiment analysis
Tweets present unique challenges — informal language, cultural references, code-switching, abbreviations, and character limits. BERTweet.BR is designed to better capture these characteristics and support downstream tasks in social media analysis for Portuguese. The model is publicly available in the 🤗 Transformers library, and the code, documentation, and experimental results are open on GitHub. We hope BERTweet.BR fosters new research in Portuguese NLP and strengthens analytical tools for social media in Brazil.
May 15, 2024 I talked to the IBERAMIA pre-event series about some simple, feasible, and practical steps to enhance the Brazilian AI ecosystem. It is now available on IBERAMIA channel .