Developed by Wenlan (Tony) Xie, this repository constitutes the official replication package for the research paper:
Tian, H., Xie, W. (Tony), & Zhang, Y. (2026). "Reading Between the Reels: An AI-Driven Approach to Analysing Movie Review Sentiment and Market Returns."(https://doi.org/10.1002/ijfe.70129) International Journal of Finance & Economics.
This project implements a production-grade Asynchronous ETL (Extract, Transform, Load) Pipeline designed to quantify investor attention distractions using unstructured textual data. By leveraging Large Language Models (GPT-4o) with strict schema validation, I processed over 247,000 IMDb movie reviews (2000-2024) to construct a high-frequency sentiment index, empirically testing the "Attention Distraction Hypothesis" in financial markets.
This repository demonstrates the integration of Software Engineering best practices into Financial Economics research, prioritizing reproducibility, scalability, and data integrity.
- Modular Architecture: The extraction logic is decoupled into distinct modules for URL discovery (
01_fetch_urls.py), metadata extraction (02_extract_metadata.py), and review mining (03_collect_reviews.py), ensuring separation of concerns. - Resilience & Idempotency: Implements state-aware execution logic. The pipeline automatically detects existing progress in
data/raw/to prevent redundant scraping and enable seamless resumption after interruptions. - Production-Grade Stability: Utilizes
tenacityfor exponential backoff retry strategies andhttpx[http2]for high-performance, asynchronous-ready network requests, significantly reducing failure rates compared to traditional synchronous scrapers.
-
High-Throughput Inference: Integrates OpenAI GPT-4o via
AsyncOpenAI. By leveraging Python'sasyncioandSemaphore, the pipeline achieves a 20x speedup in processing thousands of reviews compared to sequential execution. -
Structured Data Enforcement: Uses Pydantic models to strictly enforce output schemas (e.g., Sentiment Score
$\in [1, 10]$ ). This eliminates parsing errors common in unstructured text analysis and ensures type safety across the data pipeline. - Prompt Engineering: Employs a rigorous system prompt designed to minimize hallucination and standardize sentiment scoring across diverse review lengths and writing styles.
- Configuration as Code: All scraping parameters (headers, timeouts) and file paths are centralized in
config/settings.yaml, decoupling configuration from business logic. - Centralized Logging: Implements a robust
loggingsystem (viasrc/utils/logger.py) that captures detailed execution traces to both console and persistent log files for auditability. - Defensive Programming: Includes comprehensive type hinting (
typing), thorough docstrings, and robust error handling to handle edge cases in unstructured web data (e.g., malformed HTML, missing metadata).
This project follows a modular architecture designed for reproducibility, scalability, and separation of concerns. The directory structure is organized as follows:
├── config/ # Global configuration files
│ └── settings.yaml # Centralized parameters for timeouts, headers, and paths
│
├── data/ # Data storage (Git-ignored)
│ ├── raw/ # Immutable original corpus (Metadata, Reviews, URLs)
│ └── processed/ # Canonical datasets enriched with sentiment scores
│
├── notebooks/ # Jupyter notebooks for interactive analysis
│ └── 01_sentiment_pipeline.ipynb # Main pipeline for EDA and visualization
│
├── src/ # Source code (Python Package)
│ ├── acquisition/ # Data acquisition modules (Spiders & Scrapers)
│ │ ├── 01_fetch_urls.py # Retrieves movie URLs from IMDb
│ │ ├── 02_extract_metadata.py # Extracts high-dimensional metadata (Box Office, Credits)
│ │ └── 03_collect_reviews.py # Collects user reviews via pagination
│ │
│ ├── utils/ # Shared utility libraries
│ │ ├── config_loader.py # Singleton loader for YAML configurations
│ │ ├── logger.py # centralized logging configuration
│ │ └── text_cleaner.py # Regex-based text sanitization & normalization
│ │
│ └── __init__.py # Package initialization
│
├── .gitignore # Version control exclusions
├── LICENSE # MIT License
├── README.md # Project documentation
└── requirements.txt # Python dependencies for environment replication
- Python 3.9+
- OpenAI API Key (Required for the sentiment quantification pipeline)
- Clone the repository:
git clone [https://github.com/WLXie-Tony/Movie-Review-Sentiment-Quantification.git](https://github.com/WLXie-Tony/Movie-Review-Sentiment-Quantification.git)
cd Movie-Review-Sentiment-Quantification
- Install dependencies:
pip install -r requirements.txt
- Environment Configuration:
Create a
.envfile in the root directory to store your credentials securely. Do not hardcode keys in scripts.
OPENAI_API_KEY=sk-proj-your_api_key_here
Step 1: Data Collection (Scraping) To initiate the spider for retrieving movie metadata and raw reviews:
python src/acquisition/03_collect_reviews.py
Step 2: Sentiment Quantification (LLM Pipeline) To run the asynchronous GPT-4o analysis pipeline on the raw data:
# This notebook demonstrates the core async ETL logic
jupyter notebook notebooks/01_sentiment_pipeline.ipynb
To rigorously quantify qualitative information, I modeled the sentiment extraction process as a probabilistic mapping function:
Where:
- : Unstructured review text.
- : Vector of movie metadata (Budget, Box Office, Director).
- : Structured output (Sentiment Scalar ).
- : Temperature parameter (set to for deterministic reproducibility).
If you use this code or data in your research, please cite the associated paper:
@article{TianXieZhang2026,
title={Reading Between the Reels: An AI-Driven Approach to Analysing Movie Review Sentiment and Market Returns},
author={Tian, Haowen and Xie, Wenlan (Tony) and Zhang, Yanlei},
journal={International Journal of Finance \& Economics},
year={2026},
publisher={Wiley},
doi={10.1002/ijfe.70129}
}
Wenlan (Tony) Xie The University of Chicago
Email: wenlanx@uchicago.edu
Website: www.wenlanxie.com