View on GitHub

High performance, large-scale regression

by Alessandra Cabassi and Junyang Wang


This website serves as a blogpost summarising the work conducted in the High performance, large-scale regression project, part of the Alan Turing Institute’s Summer Internship programme in 2018, sponsored by Cray Inc.. The aims of the project were to ‘To investigate distributed, scalable approaches to the standard statistical task of highdimensional regression with very large amounts of data, with the ultimate goal of informing current best practice in terms of algorithms, architectures and implementations.’ The Cray Urika-GX supercomputer provided the computing power required to implement and run regression algorithms on very large datasets. An airplane dataset documenting flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008, consisting of over 120 million rows of data, was used as a case study.

Summaries of the work conducted can be found on the following pages:

Regression Benchmarking with Spark

Airplane data Introduction

Airplane data Logistic Regression with Apache Spark

Airplane data Logistic Regression with Tensorflow

Other Tensorflow tips

The project was conducted by Alessandra Cabassi (PhD student, University of Cambridge) and Junyang Wang (PhD student, University of Newcastle) under the supervision of:

Anthony Lee (Senior Lecturer, The Alan Turing Institute, University of Bristol)

Ioannis Kosmidis (Reader, The Alan Turing Institute, University of Warwick)

Rajen Shah (Lecturer, Turing Fellow, The Alan Turing Institute, University of Cambridge)

Yi Yu (Lecturer, University of Bristol)