The feature.py
module cleans MBTA transaction-level data and extracts rider-level pattern-of-use features.
DataLoader
A DataLoader
object first merges the joint AFC_ODX table, the stops table and fare product tables to form transaction records. The preprocessed transaction records are then passed to a FeatureExtractor
object to extract the rider-level pattern-of-use features.
Note: A DataLoader
object is initialized by a FeatureExtractor
object and is not explicitly used elsewhere in our project.
start_month
: a string representing the start month in the format of YYMM, e.g. ‘1710’duration
: an integer representing the length of duration (in months)afc_odx_fields
: a list of fields used to read the joint AFC_ODX table, [‘deviceclassid’, ‘trxtime’, ‘tickettypeid’, ‘card’, ‘origin’, ‘movementtype’]fp_field
: a list of fields used to read the fare product table, [‘tariff’, ‘servicebrand’, ‘usertype’, ‘tickettypeid’, ‘zonecr’]stops_fields
: a list of fields used to read the stops table, [‘stop_id’, ‘zipcode’]fareprod
: a DataFrame of fare product recordsstops
: a DataFrame of stop recordsstation_deviceclassid
: a list of device class IDs of interest, [411, 412, 441, 442, 443, 501, 503]validation_movementtype
: a list of validation movement types of interest, [7, 20]df
: a DataFrame of preprocessed transaction records__init__(self, start_month, duration)
:
load(self)
:
start_month
and duration
as a DataFramestation_deviceclassid
and validation_movementtype
of interestself.df
FeatureExtractor
A FeatureExtractor
object extracts the rider-level temporal, geographical, and ticket-purchasing features based on the preprocessed transaction records returned by the DataLoader
Label riders by their total number of trips, and whether they use commuter rail expect for zone 1a
The second step is for further filtering in segmentation model.
start_month
: a string representing the start month in the format of YYMM, e.g. ‘1710’duration
: an integer representing the length of durationdf_transaction
: a DataFrame of preprocessed transaction records returned by a DataLoader
objectpurchase_features
: a list of ticket-purchasing types, [‘tariff’, ‘usertype’, ‘servicebrand’, ‘zonecr’]__init__(self, start_month='1701', duration=1)
:
DataLoader
object_extract_temporal_patterns(self)
: For each riderID
,
_extract_geographical_patterns(self)
:
_get_one_purchase_feature(self, feature)
:
feature
: one item from the purchase_features
listfeature
, each with a column name prefix ‘{feature}_’_extract_ticket_purchasing_patterns(self)
:
_get_one_purchase_feature(self, feature)
to extract the number of trips associated with all ticket purchasing dimensions specified in purchase_features
_label_rider_by_trip_frequency(self, rider)
:
rider
: a row in the rider feature DataFrame_label_commuter_rail_rider(self, rider)
:
rider
: a row in the rider features DataFrameextract_features(self)
:
_extract_temporal_patterns()
, _extract_geographical_patterns()
, _extract_ticket_purchasing_patterns()
to extract each group of features.csv
file to the features cache pathSegmentation
A Segmentation
object clusters riders by their temporal, geographical, and ticket-purchasing features based on a user-specified pipeline option (hierarchical vs. non-hierarchical), a user-specified algorithm option (kmeans vs. lda) and a user-specified feature weighs.
random_state
, max_iter
, tol
: attributes for initializing K-meansstart_month
: a string representing the start month in the format of YYMM, e.g. ‘1710’duration
: an integer representing the length of durationw_time_choice
: an integer from 0 to 100 representing the weight of temporal features as percentageN_riders
: number of riders in the rider features DataFrametime_feats
: a list of column name of all sets of temporal featuresgeo_feats
: a list of column names of all geographical featurespurchase_feats
: a list of column names of all ticket-purchasing features
self.weekday_vs_weekend_feats = [‘weekday’, ‘weekend’]features
: a list of concatenated time, geo and ticket-purchasing features for non-hierarchical clusteringfeatures_layer_1
: a list of features used in the initial clustering stepfeatures_layer_2
: a list of features used in the final clustering stepw_time
: weight for temporal patternsw_geo
: weight for geographical patternsw_purchase
: weight for purchase patternsw_week
: weight for weekend/weekday columnsdf
: the rider features DataFrame__init__(self, w_time=None, start_month='1701', duration=1, random_state=RANDOM_STATE, max_iter=MAX_ITER, tol=TOL)
:
__get_data()
, __standardize_features()
, __normalize_features()
__get_data(self)
:
self.df
the extracted rider features DataFrame of the specified start_month
and duration
__standardize_features(self)
:
__normalize_features(self)
:
__apply_clustering_algorithm(self, features, model, n_clusters_list=[2, 3, 4, 5])
:
features
: the rider features DataFramemodel
: an initialized but not fitted sklearn K-means or LDA objectn_clusters_list
: a list of integers to representing the range of possible number of clustersfeatures
using the segmentation model
for all values in n_clusters_list
__get_cluster_score(self, features, cluster_labels)
to get a Calinski-Harabaz Index score__get_cluster_score(self, features, cluster_labels)
:
features
: rider features DataFramecluter_labels
: a list of integers representing cluster assignments__initial_rider_segmentation(self, hierarchical=False)
hierarchical
: a boolean indicator of whether the segmentation pipeline is hierarchical or not__final_rider_segmentation(self, model, features, n_clusters_list=[2, 3, 4, 5], hierarchical=False):
:
hierarchical
: a boolean indicator of whether the segmentation pipeline is hierarchical or notget_rider_segmentation(self, hierarchical=False)
:
hierarchical
: a boolean indicator of whether the segmentation pipeline is hierarchical or not.csv
files to the cluster cache pathCensusFormatter
A CensusFormatter
object formats the census data to counts, percentages or proportions based on the user’s specification.
new_col_names
: static class variable, a list of column names used to rename raw census columnscensus_groups
: static class variable, a dictionary for census groups and prefixes in the renamed census DataFrameraw_census_filepath
: a string representing the file path of the raw census datacensus_in_counts
: a DataFrame of census data represented in countscensus_in_percents
: a DataFrame of census data represented in percentagescensus_in_proportions
: A DataFrame of census data represented in proportions__init__(self, raw_census_filepath)
:
__format_raw_census_in_counts(self, raw_census_filepath)
:
raw_census_filepath
: a string of file path to raw census data__convert_to_percents(self, census_in_counts)
:
census_in_counts
: a DataFrame of census data represented in counts (the output of __format_raw_census_in_counts()
)__convert_to_proportions(self, census_in_counts)
:
census_in_counts
: a DataFrame of census data represented in counts (the output of __format_raw_census_in_counts()
)to_csv(self, filename, census_type='proportions')
:
filename
: A string of file name to savecensus_type
: A string of which census type to save, options = [‘percents’, ‘counts’, ‘proportions’], default = ‘proportions’census_type
get_census_in_counts(self)
:
get_census_in_percents(self)
:
census_in_percents
get_census_in_proportions(self)
:
census_in_proportions
ClusterProfiler
:A ClusterProfiler
object summarizes each cluster’s overall pattern-of-use features and infers its demographics distributions based on the mapping from its softmax transformed geographical patterns to the census data.
feat_groups
: static class variable, a list of feature groups and expected prefixes in rider features DataFramedemo_groups
: static class variable, a dictionary for demographics groups and prefixes in the profiled cluster DataFramestart_month
: a string representing the start month in the format of YYMM, e.g. ‘1710’duration
: an integer representing the length of durationhierarchical
: a boolean indicator of whether the segmentation pipeline is hierarchical or notcensus
: a DataFrame of census data represented in counts returned by a CensusFormatter
objectw_time
: an integer from 0 to 100 representing the weight of temporal features as percentageinput_path
: a string of the cached cluster resultsparam_keys
: a list of parameter keysriders
: the rider feature DataFrame with cluster assignments__init__(self, hierarchical=False, w_time=None, start_month='1701', duration=1)
:
__get_data()
to get the rider features DataFrame with cluster assignments__split(self, delimiters, string, maxsplit=0)
:
delimiters
to match cached results__get_cached_params_list(self)
:
__get_data(self)
:
self.riders
attribute_softmax(self, df)
:
df
: a DataFrame of numeric valuesdf
into softmax probabilities_summarize_features(self, riders, by_cluster)
:
riders
: a DataFrame containing rider-level pattern-of-use features used to form clusters plus resulting cluster assignmentby_cluster
: a boolean indicating whether to summarize features by cluster or overall_summarize_demographics(self, cluster_features)
:
cluster_features
: a DataFrame containing cluster-level pattern-of-use features_get_first_2_pca_components(self, features)
:
features
: a DataFrame containing cluster-level pattern-of-use featuresextract_profile(self, algorithm, by_cluster)
:
algorithm
: a string, options = [‘kmeans’, ‘lda’]by_cluster
: a boolean indicating whether to summarize features by cluster or overall__save_profile(self, profile, algorithm, by_cluster)
__save_profile(self, profile, algorithm, by_cluster)
:
profile
: a DataFrame of summarized cluster pattern-of-use features with its demographics distributionalgorithm
: a string, options = [‘kmeans’, ‘lda’]by_cluster
: a boolean indicating whether to summarize features by cluster or overallVisualization
:A Visualization
object visualizes the cluster profiles in various types of visualizations (i.e. static heatmap for cluster temporal patterns, static scatter chart for visualizing clusters on 2D PCA-subspace, interactive map for cluster geographical patterns, and static bar charts for other cluster statistics)
start_month
: a string representing the start month in the format of YYMM, e.g. ‘1710’duration
: an integer representing the length of durationinput_path
: a string for the input directory path (path to cached profiles)output_path
: a string for the output directory pathparam_keys
: a list of parameter keys for matching user-specified options to the cached cluster profiles (this is primarily for reading in the data)df
: the DataFrame with cluster profiles data for visualizationreq_view
: User-specified view request, options are [“overview”, “hierarchical”, “non-hierarchical”]. “Overview” is the option to view the overall pattern where all riders are treated as one big cluster. “Hierarchical” and “non-hierarchical” are options to view the clustering results from the hierarchical or the non-hierarchical pipeline.req_w_time
: User-specified weights on temporal patterns, possible values are integers from 0 to 100. Note, equal weighting between temporal, geographical, and ticket purchasing patterns is assumed if this weight is set to 0 or is left unspecified.req_algo
: User-specified algorithm request, options are [“lda”, “kmeans”] for viewing the clustering results from LDA or K-means algorithms, repspectively.__init__(self, start_month='1701', duration=1)
:
start_month
, duration
, input_path
, output_path
and param_keys
)output_path
if it does not already exist__split(self, delimiters, string, maxsplit=0)
:
delimiters
to match cached results__get_cached_params_list(self)
:
__read_csv(self, req_param_dict, by_cluster)
:
self.df
attributeload_data(self, by_cluster=False, hierarchical=False, w_time=None, algorithm=None)
:
__read_csv()
function for reading in the data if the requested file exists in the cached_profile directory or make a ClusterProfiler
to extract the requested cluster profilevisualize_clusters_2d(self)
:
plot_cluster_hourly_pattern(self, cluster)
:
plot_all_hourly_patterns(self)
:
self.df
attributeplot_cluster_geo_pattern(self, cluster)
:
cached_viz
unless the user resets the path in config.py
)__single_feature_viz(self, feature, title, ylabel, xlabel)
:
plot_cluster_size
and plot_avg_num_trips
functions below to plot a single bar chart of cluster feature VS. cluster ID__group_feature_viz(self, grp_key, stacked, title, ylabel, xlabel)
:
plot_demographics
and plot_ticket_purchasing_patterns
functions below to plot Multi-bar plot visualizations of cluster feature VS. cluster IDplot_cluster_size(self)
:
plot_avg_num_trips(self)
:
plot_demographics(self, grp, stacked=True)
:
grp
option specifies which type of demographics distribution to display. Options are [‘race’, ‘emp’, ‘edu’, ‘inc’] for race, employment, education and income.plot_ticket_purchasing_patterns(self, grp, stacked=True)
:
grp
option specifies which type of ticket purchasing habit to display. Options are [‘servicebrand’, ‘usertype’, ‘tariff’] for service brand (e.g. Rapid Transit), user type (e.g. Adult or Student), and tariff type (e.g. Monthly Pass).ReportGenerator
A ReportGenerator
object is initialized in a ClusterProfiler
object. It generates a text summary for each cluster based on the output of the ClusterProfiler
that contains it and a pre-trained (and retrainable) Convolutional Neural Network (CNN) model for the 7x24 temporal pattern classification.
n_classes
: an integer indicating the number of different types of riders to classifycnn_model_filename
: a string of the file name (.h5
) of the pre-trained CNN modelsample_factor
: an integer indicating the factor for oversampling clusters’ 7 x 24 time matricesnoise_std
: a float indicating the standard deviation for the Gaussian noise signal used in oversampling the time matricesget_text(self, row)
:
row
: a row of the profiled cluster DataFramegenerate_report(self, df)
:
df
: a DataFrame of the profiled clusterdf
using the pre-trained CNN modeldf