View on GitHub

High performance, large-scale regression

by Alessandra Cabassi and Junyang Wang

Logistic and linear regression with TensorFlow

We make use of Estimators, a high-level TensorFlow API that includes implementations of the most popular machine learning algorithms. Here, in order to perform linear regression, we use the LinearClassifier estimator. You can learn more about Estimators on the TensorFlow official website: https://www.tensorflow.org/guide/estimators.

Training

Instantiating and training a LinearClassifier is very simple. Assuming to have defined a set of numeric columns my_numeric_columns and categorical columns my_categorical_columns, we can istantiate a LinearClassifier as follows:

import tensorflow as tf
classifier = tf.estimator.LinearClassifier(feature_columns=my_numeric_columns+my_categorical_columns)

For example, for the airline data, the columns can be defined as:

import tensorflow.feature_column as fc

year = fc.categorical_column_with_vocabulary_list('Year', ['1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008'])
month = fc.categorical_column_with_vocabulary_list('Month', ['1','2','3','4','5','6','7','8','9','10','11','12'])
dayofmonth = fc.categorical_column_with_vocabulary_list('DayofMonth', ['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20', '21', '22', '23', '24', '25', '26', '27', '28', '29','30', '31'])
dayofweek = fc.categorical_column_with_vocabulary_list('DayOfWeek', ['1','2','3','4','5','6','7'])
deptime = fc.numeric_column('DepTime')
arrtime = fc.numeric_column('ArrTime')
uniquecarrier = fc.categorical_column_with_hash_bucket('UniqueCarrier', hash_bucket_size=1000)
flightnum = fc.categorical_column_with_hash_bucket('FlightNum', hash_bucket_size=10000)
arrdelay = fc.numeric_column('ArrDelay')
depdelay = fc.numeric_column('DepDelay')
origin = fc.categorical_column_with_hash_bucket('Origin', hash_bucket_size=1000)
dest = fc.categorical_column_with_hash_bucket('Dest', hash_bucket_size=1000)
distance = fc.numeric_column('Distance')

Note that we have used three types of columns: fc.numeric_column, for continuous variables, fc.categorical_column_with_vocabulary_list for categorical variables for which all the classes are known and can be easily enumerated, fc.categorical_column_with_hash_bucket for categorical variables with a high number of classes (such as FlightNum). The parameter hash_bucket_size is an upper bound on the number of categories. More information about the different types of feature columns available for TensorFlow estimators can be found at https://www.tensorflow.org/guide/feature_columns.

For clarity of exposition, we divide them into numeric and categorical columns:

my_numeric_columns = [deptime, arrtime, distance] #depdelay
my_categorical_columns = [year, month, dayofmonth, dayofweek, uniquecarrier, flightnum, origin, dest, cancelled, diverted]

Once the Estimator has been instantiated, it can be easily trained with the train method:

classifier.train(train_inpf)

where train_inpf is the input function that feeds the data into the function.

Defining an input function

The input function train_inpf is defined in four steps.

1. Defining input format

First, we need to define the names of the columns in the dataset CSV_COLUMNS, the corresponding default values DEFAULTS, and the name of the response variable LABEL_COLUMN.

CSV_COLUMNS = ['Year', 'Month', 'DayofMonth', 'DayOfWeek', 'DepTime', 'ArrTime', 'UniqueCarrier', 'FlightNum',  'ArrDelay', 'DepDelay', 'Origin', 'Dest', 'Distance', 'Cancelled', 'Diverted']
DEFAULTS = [[""], [""], [""], [""], [0], [0], [""], [""], [0.], [0.],[""], [""], [0], [""],[""]]
LABEL_COLUMN = 'ArrDelay'

Note that we have chosen [""] as a default for the categorical variables, [0] for the integer variables, and [0.] for the continuous variables. Defaults also define the type of the input column to be loaded from file, so it is important that they match the variable type.

2. Parsing csv files

To parse the csv files, we first need to be matched by the input files. In this case, after downloading the data in csv format in a dedicated folder, we can easily indicate to our parser that we want to train the model on the year 2006,

train_file = "2006.csv"

or that we want to use all the data in that folder during the training step.

train_file = "*.csv"

Now we can define the parser:

def parse_csv(value):
      tf.logging.info('Parsing {}'.format(data_file))
      columns = tf.decode_csv(value, record_defaults=DEFAULTS, select_cols = [0, 1, 2, 3, 4, 6, 8, 9, 14, 15, 16, 17, 18, 19, 21], na_value="NA")
      features = dict(zip(CSV_COLUMNS, columns))
      labels = features.pop('ArrDelay')
      # Define the two classes for logistic regression
      # If the DepDelay is greater than 0, than the label is True (i.e. the flight was delayed)
      # Otherwise, 
      classes = tf.greater(labels, 0) 
      return features, classes

3. Defining the input function

In the input function, we need to

def input_fn(data_file, num_epochs, shuffle, batch_size, buffer_size=1000):
      # Create list of file names that match "glob" pattern (i.e. data_file_*.csv)
      filenames_dataset = tf.data.Dataset.list_files(data_file)
      # Read lines from text files
      textlines_dataset = filenames_dataset.flat_map(tf.data.TextLineDataset)
      # Parse text lines as comma-separated values (CSV)
      dataset = textlines_dataset.map(parse_csv)
      if shuffle:
          dataset = dataset.shuffle(buffer_size=buffer_size)
      # We call repeat after shuffling, rather than before, to prevent separate epochs from blending together.
      dataset = dataset.repeat(num_epochs)
      # Get a batch of data of size bathc_size
      dataset = dataset.batch(batch_size)
      return dataset

4. Defining a wrapper for the input function

Finally, since the arguments of classifier.train cannot take any input, we have to wrap our input functions into a new function that does not take any argument:

train_inpf = functools.partial(input_fn, train_file, num_epochs=1, shuffle=True, batch_size=100)

Testing and prediction

The wrappers for the input functions of the evaluation and prediction steps can be defined similarly to before:

eval_inpf = functools.partial(input_fn, predict_file, num_epochs=1, shuffle=False, batch_size=100)
predict_inpf = functools.partial(input_fn, predict_file, num_epochs=1, shuffle=False, batch_size=100)

Just like training, all you need for the evaluation of an Estimator is encapsulated in one function:

result = classifier.evaluate(eval_inpf)

The output of the evaluation is a set of metrics that can be displayed

for key,value in sorted(result.items()):
  print('%s: %s' % (key, value))

Similarly for prediction:

pred_results = classifier.predict(input_fn=predict_inpf)
for i in range(10):
    print(next(pred_results))

Before moving to the next section, note that if you train the LinearClassifier (as of 11 September 2018), this will print a warning:

WARNING:tensorflow:Trapezoidal rule is known to produce incorrect PR-AUCs; please switch to "careful_interpolation" instead.

To avoid this problem, you can define an alternative function to calulate the area under the curve:

def metric_auc(labels, predictions):
    return {
        'auc_precision_recall': tf.metrics.auc(
            labels=labels, predictions=predictions['logistic'], num_thresholds=200,
            curve='PR', summation_method='careful_interpolation')
    }

and add it to your classifier:

classifier = tf.contrib.estimator.add_metrics(classifier, metric_auc)

Since the new metric has the same name of the existing one, the latter will be overwritten.

Retrieving the regression coefficients

Finally, if you want to retrieve the regression coefficients, you can use the following function, that returns the weight names and the corresponding set of coefficients:

def get_flat_weights(model):
   weight_names = [
       name for name in model.get_variable_names()
       if "linear_model" in name and "Ftrl" not in name]
   for name in model.get_variable_names():
       print(name)
       print(model.get_variable_value(name))
   weight_values = [model.get_variable_value(name) for name in weight_names]
   weights_flat = np.concatenate([item.flatten() for item in weight_values], axis=0)
   return weight_names, weights_flat

The full code of this tutorial can be found at https://github.com/acabassi/large-scale-regression/blob/master/tensorflow-linear-classifier.py The version of the code that can be run in parallel can be found at https://github.com/acabassi/large-scale-regression/blob/master/tensorflow-linear-classifier-parallel.py

Linear regression

If instead we wanted to predict exactly the flight delays in minutes, we could have done exactly the same as above, replacing LinearClassifier with LinearRegressor and not binarising the ArrDelay variable in the input function.

The full code to perform linear regression on the airline data the can be found at https://github.com/acabassi/large-scale-regression/blob/master/tensorflow-linear-regressor.py The version of the code that can be run in parallel can be found at https://github.com/acabassi/large-scale-regression/blob/master/tensorflow-linear-regressor-parallel.py