Skip to content

São Paulo's Legislative Discourses Sentiment Analysis

Python Natural Language Processing (NLP)
100%
Download Success
99.4%
Transcription Success
98.7%
Processing Success

overview

An end-to-end Natural Language Processing (NLP) pipeline for analyzing legislative speeches from São Paulo City Council plenary sessions.

The project automates the collection, transcription, preprocessing, sentiment analysis, topic extraction, and visualization of parliamentary debates.

  • Automated audio extraction from YouTube sessions

  • Speech-to-text transcription using Faster Whisper

  • Portuguese text preprocessing with spaCy

  • Sentiment classification using Transformer models

  • Topic modeling with BERTopic

  • Interactive dashboard with Streamlit

  • Data persistence using DuckDB and Parquet

Audio Acquisition → Transcription → Preprocessing → Sentiment Analysis → Topic Modeling → Dashboard

  1. Clone repository:
Terminal window
git clone https://github.com/cintia-shinoda/legislative-nlp-pipeline.git
cd legislative-nlp-pipeline
  1. Create and Activate Virtual Environment:
Terminal window
python -m venv venv
source venv/bin/activate # MacOS/Linux
.venv\Scripts\activate # Windows
  1. Install dependencies:
Terminal window
pip install -r requirements.txt
  1. Install spaCy:
Terminal window
python -m spacy download pt_core_news_lg
Terminal window
python src/download_audio.py # download audio
python src/transcribe.py # transcribe sessions
python src/preprocess.py # preprocess text
python src/sentiment.py # perform sentiment analysis
python src/topics.py # extract topics
streamlit run src/dashboard.py # launch dashboard
Terminal window
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
yt-dlp │faster-whisper│ spaCy
(download) │ │ (transcrição)│ │ (limpeza) │
│───▶│ │───▶│
URL .wav .wav texto texto bruto
limpo
└─────────────┘ └──────────────┘ └──────────────┘
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
Streamlit │◀───│ BERTopic │◀───│ BERTimbau
(dashboard) │ │ (tópicos) │ │ (sentimento) │
dados textos texto limpo
gráficos clusters score
└─────────────┘ └──────────────┘ └──────────────┘

Potential future improvements include:

  • Speaker identification
  • Named entity analytics
  • Legislative bill tracking
  • Political alignment analysis
  • RAG-based question answering
  • Longitudinal topic evolution
  • Real-time session monitoring