São Paulo's Legislative Discourses Sentiment Analysis
100%
Download Success
99.4%
Transcription Success
98.7%
Processing Success
overview
An end-to-end Natural Language Processing (NLP) pipeline for analyzing legislative speeches from São Paulo City Council plenary sessions.
The project automates the collection, transcription, preprocessing, sentiment analysis, topic extraction, and visualization of parliamentary debates.
Features
Section titled “Features”-
Automated audio extraction from YouTube sessions
-
Speech-to-text transcription using Faster Whisper
-
Portuguese text preprocessing with spaCy
-
Sentiment classification using Transformer models
-
Topic modeling with BERTopic
-
Interactive dashboard with Streamlit
-
Data persistence using DuckDB and Parquet
Pipeline Overview
Section titled “Pipeline Overview”Audio Acquisition → Transcription → Preprocessing → Sentiment Analysis → Topic Modeling → Dashboard
Getting Started
Section titled “Getting Started”- Clone repository:
git clone https://github.com/cintia-shinoda/legislative-nlp-pipeline.git
cd legislative-nlp-pipeline- Create and Activate Virtual Environment:
python -m venv venv
source venv/bin/activate # MacOS/Linux.venv\Scripts\activate # Windows- Install dependencies:
pip install -r requirements.txt- Install spaCy:
python -m spacy download pt_core_news_lgpython src/download_audio.py # download audiopython src/transcribe.py # transcribe sessionspython src/preprocess.py # preprocess textpython src/sentiment.py # perform sentiment analysispython src/topics.py # extract topicsstreamlit run src/dashboard.py # launch dashboardArchitecture
Section titled “Architecture” ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ │ yt-dlp │ │faster-whisper│ │ spaCy │ │ (download) │ │ (transcrição)│ │ (limpeza) │ │ │───▶│ │───▶│ │ │ URL → .wav │ │ .wav → texto │ │ texto bruto │ │ │ │ │ │ → limpo │ └─────────────┘ └──────────────┘ └──────────────┘ │ │ ▼ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ │ Streamlit │◀───│ BERTopic │◀───│ BERTimbau │ │ (dashboard) │ │ (tópicos) │ │ (sentimento) │ │ │ │ │ │ │ │ dados → │ │ textos → │ │ texto limpo │ │ gráficos │ │ clusters │ │ → score │ └─────────────┘ └──────────────┘ └──────────────┘Future Work
Section titled “Future Work”Potential future improvements include:
- Speaker identification
- Named entity analytics
- Legislative bill tracking
- Political alignment analysis
- RAG-based question answering
- Longitudinal topic evolution
- Real-time session monitoring