Rider Segmentation

Contents

Segmentation Approach Overview

Our overall segmentation approach is summarized in Figure 1.

Figure 1: Segmentation Pipeline

Based on feedback from the client, we first filtered out riders with a commuter rail pass (except for Zone 1a) and riders who ride fewer than 5 trips per month. The rationale is that there is not enough information for performing inference on these riders based on the current fare transaction data collection system.

Recall that our feature sets (details) are:

Our modeling structure offers several functional options for the users to choose. We discuss these options below.

User-Specified Options

Pipeline Options

We implemented two clustering pipelines:

To optimize the number of clusters at each clustering step (initial and final), we used the Calinski and Harabaz score, which is defined as the ratio between the within-cluster dispersion and the between-cluster dispersion. For more information on this scoring criterion, please consult the Sklearn documentation on Calinski Harabaz Score.

Algorithm Options

We implemented two unsupervised clustering algorithms for the user to choose and compare:

Options for Feature Weights

Since policy makers might wish to develop policies targeting a certain aspect of rider behavior (e.g. time of use vs. location of use), our package allows flexibility in weighting different feature sets. Recall that we roughly divided our feature sets into 3 broad categories: temporal, geographical and ticket purchasing patterns. The segmentation module in the package takes in a single user-specified value for weights on the temporal patterns (w, from 0 to 100), and the two different pipelines handles this user input differently.

With the non-hierarchical pipeline:

Step Feature Set Weight
Final Clustering 168 Hourly Temporal Patterns + Most Frequent Trip Hours + Time Flexibility Scores w
  Geographical Patterns by Zip Codes 0.5(100 - w)
  Ticket Purchasing Patterns 0.5(100 - w)

With the hierarchical pipeline:

Step Feature Set Weight
Initial Clustering Weekend-vs-Weekday 100
  Ticket Purchasing Pattern 100
Final Clustering 168 Hourly Temporal Patterns + Most Frequent Trip Hours + Time Flexibility Scores w
  Geographical Patterns by Zip Codes 100 - w

Summary Findings