São Paulo's Legislative Discourses Sentiment Analysis

Python Natural Language Processing (NLP)

100%

Download Success

99.4%

Transcription Success

98.7%

Processing Success

> code

overview

An end-to-end Natural Language Processing (NLP) pipeline for analyzing legislative speeches from São Paulo City Council plenary sessions.

The project automates the collection, transcription, preprocessing, sentiment analysis, topic extraction, and visualization of parliamentary debates.

Features

Automated audio extraction from YouTube sessions
Speech-to-text transcription using Faster Whisper
Portuguese text preprocessing with spaCy
Sentiment classification using Transformer models
Topic modeling with BERTopic
Interactive dashboard with Streamlit
Data persistence using DuckDB and Parquet

Pipeline Overview

Audio Acquisition → Transcription → Preprocessing → Sentiment Analysis → Topic Modeling → Dashboard

Getting Started

Clone repository:

git clone https://github.com/cintia-shinoda/legislative-nlp-pipeline.git

cd legislative-nlp-pipeline

Create and Activate Virtual Environment:

python -m venv venv

source venv/bin/activate   # MacOS/Linux
.venv\Scripts\activate     # Windows

Install dependencies:

pip install -r requirements.txt

Install spaCy:

python -m spacy download pt_core_news_lg

python src/download_audio.py     # download audio
python src/transcribe.py         # transcribe sessions
python src/preprocess.py         # preprocess text
python src/sentiment.py          # perform sentiment analysis
python src/topics.py             # extract topics
streamlit run src/dashboard.py   # launch dashboard

Architecture

        ┌─────────────┐    ┌──────────────┐    ┌──────────────┐
        │   yt-dlp    │    │faster-whisper│    │    spaCy     │
        │  (download) │    │ (transcrição)│    │  (limpeza)   │
        │             │───▶│              │───▶│              │
        │ URL → .wav  │    │ .wav → texto │    │ texto bruto  │
        │             │    │              │    │  → limpo     │
        └─────────────┘    └──────────────┘    └──────────────┘
                                                       │
                                                       │
                                                       ▼
        ┌─────────────┐    ┌──────────────┐    ┌──────────────┐
        │  Streamlit  │◀───│   BERTopic   │◀───│  BERTimbau   │
        │ (dashboard) │    │  (tópicos)   │    │ (sentimento) │
        │             │    │              │    │              │
        │   dados →   │    │   textos →   │    │ texto limpo  │
        │  gráficos   │    │  clusters    │    │  → score     │
        └─────────────┘    └──────────────┘    └──────────────┘

Future Work

Potential future improvements include:

Speaker identification
Named entity analytics
Legislative bill tracking
Political alignment analysis
RAG-based question answering
Longitudinal topic evolution
Real-time session monitoring

São Paulo's Legislative Discourses Sentiment Analysis

Features

Pipeline Overview

Getting Started

Architecture

Future Work

Troubleshooting

Contributing

License