MBTA Rider Segmentation

Contents

Team

Harvard 2018 Spring AC297r Capstone Project: Chia Chi (Michelle) Ho, Yijun Shen, Jiawen Tong, Anthony Hou

Motivation & Problem Statement

The Massachusetts Bay Transportation Authority (MBTA) is the largest public transportation agency in New England, delivering a complex system of subway, bus, commuter rail, light rail, and ferry services to riders in the dynamic economy of the Greater Boston Area. It is estimated that MBTA provides over 1.3 million trips on an average weekday. While MBTA collects a wealth of trip transaction data on a daily basis, a persistent limitation has been the organization’s lack of knowledge around rider groups and their respective ridership habits. Understanding rider segmentation in the context of pattern-of-use has significant implications in developing new policies to improve its service planning and in potentially changing its fare structure. Therefore, we aim to develop a flexible, reusable rider segmentation model on MBTA’s “core system” (encompassing local buses and subway) that can group individuals according to pattern-of-use dimensions.

Project Deliverables

Our project deliverables are:

Figure 1: Project Deliverables

The specific goals of each project deliverables are:

Note: In addition to the code base, we are delivering pre-ran monthly clustering results for Dec 2016 to Nov 2017 with equal weighting on temporal, geographical and ticket purchasing patterns. This set of cached results are the data available for display on the Github version of our dashboard.

Our Github Organization:

Source code can be found on our Github organization (https://github.com/AC297r-MBTA-2018). It contains 4 repositories:

Note: The limited Dashboard, Final Report and Code Documentation are linked via a navigation bar on respective Github pages.

Data Description

Available data sources:

The features we used and how we merged different data sources are summarized in Figure 2.

Figure 2: Data Structure

Literature Review

  1. Case Study in France: (Mahrsi et al. (2014). Understanding Passenger Patterns in Public Transit Through Smart Card and Socioeconomic Data. UrbComp.)

    In this paper, the authors presented an approach to mine passenger temporal behaviors in order to extract interpretable rider clusters. Briefly, each rider is represented as a vector of 168 features, where each feature is the number of trips the passenger took in a certain hour of a certain day of week (24 hours/day x 7 days a week = 168 hours/week). Using a mixture unigram model, they obtained a set of 16 temporal clusters, each describing a temporal mobility pattern. Such patterns include typical commuter patterns, different morning peak times and different travel behaviors on weekends. To infer cluster socioeconomic characteristics, the authors first used a Hidden Random Markov Field model and the US census data to cluster residential neighborhoods based on socioeconomic characteristics. The riders were then assigned a socioeconomic class based on their inferred residential location. The authors found that the temporal clusters differed in their inferred socioeconomic distributions.

  2. Case Study in London: (Langlois et al. (2015). Inferring patterns in the multi-week activity sequences of public transport users. Transportation Research Part C.

    In this study, the authors investigated passenger heterogeneity based on a longitudinal representation of each user’s multi-week activity sequence derived from smart card data. Then, they identified clusters of users with similar activity sequence structure. The application reveals 11 clusters, each characterized by a distinct sequence structure. Combined with demographic attributes including passenger age, occupation, household composition and income, and vehicle ownership from a small sample of users, the analysis revealed that significant connections exist between the user demographic attributes and activity patterns identified exclusively from fare transactions.

Modeling Approach Overview

Our overall modeling approach is summarized in Figure 3.

Figure 3: Modeling Approach Overview. The approach is presented in the context of the overall structure of our project deliverables. Elements belonging to the Python segmentation package and visualization exploration tool are colored in dark blue and light blue-green, respectively.

Feature Sets

Sample Results

Since there are too many combinations of month/pipeline/algorithm, we only present the results for Oct 2017. Please see our dashboard (https://ac297r-mbta-2018.github.io/Dashboard/) to explore more results.

Comparing Cluster Statistics

Figure 4: Simple Cluster Statistics Comparison. Number of riders and average number of trips of clusters found using - A. The hierarchical pipeline and the LDA algorithm; B. The non-hierarchical pipeline and the LDA algorithm; C. The hierarchical pipeline and the K-means algorithm; D. The non-hierarchical pipeline and the K-means algorithm.

Comparing Cluster Temporal Patterns

Figure 5 shows the comparison of cluster temporal patterns found using different methods. All methods found distinct temporal patterns across different clusters. In general, the interpretations of these clusters are similar (e.g. weekend rider vs. commuters vs. random riders), but the hierarchical pipeline implementation found more subtle differences between clusters compared to the non-hierarchical pipeline.

Figure 5: Cluster Temporal Patterns Comparison. Temporal patterns of clusters found using - A. The hierarchical pipeline and the LDA algorithm; B. The non-hierarchical pipeline and the LDA algorithm; C. The hierarchical pipeline and the K-means algorithm; D. The non-hierarchical pipeline and the K-means algorithm.

Comparing Cluster Geographical Patterns

Figure 6: Selected Cluster Geographical Patterns Comparison. Geographical patterns of clusters found using - A. The hierarchical pipeline and the LDA algorithm; B. The non-hierarchical pipeline and the LDA algorithm; C. The hierarchical pipeline and the K-means algorithm; D. The non-hierarchical pipeline and the K-means algorithm.

Comparing Cluster Ticket Purchasing Patterns

Figure 7: Cluster Ticket Purchasing Patterns Comparison. Ticket purchasing patterns of clusters found using - A. The hierarchical pipeline and the LDA algorithm; B. The non-hierarchical pipeline and the LDA algorithm; C. The hierarchical pipeline and the K-means algorithm; D. The non-hierarchical pipeline and the K-means algorithm.

Comparing Inferred Cluster Demographics

Figure 8: Inferred Cluster Demographics Comparison. Inferred demographics distributions of clusters found using - A. The hierarchical pipeline and the LDA algorithm; B. The non-hierarchical pipeline and the LDA algorithm; C. The hierarchical pipeline and the K-means algorithm; D. The non-hierarchical pipeline and the K-means algorithm.

Sample Report

Sample of automatically generated reports for clusters found using the Hierarchical pipeline and the LDA algorithm:

Sample of automatically generated reports for clusters found using the Non-Hierarchical pipeline and the K-means algorithm:

Conclusions

What We Deliver

Key Findings

Note: Due to the limited of memory on laptops, we were only able to analyze monthly riderships. With more powerful machines, rider segmentation over a longer time period (e.g. seasonal or yearly) could be possible, thus allowing ridership stability analyses.

Future Work