# Airline data

In the following, we analyse the airline dataset, publicly available for download from http://stat-computing.org/dataexpo/2009/. The data contains records of all commericial flights within the USA, from October 1987 to April 2008. It can be downloaded as 22 separate csv files, each containing the data for one year. When unzipped, the files take up 12 GB.

Each column in the csv files corresponds to one of the following covariates. Among others:`Year`

comprised between 1987 and 2008, `Month`

, `DayOfMonth`

, `DayOfWeek`

expressed as integers (for the days of the week, 1 is Monday), `CRSDepTime`

and `CRSArrTime`

the expected arrival and departure local times in the hhmm format,`UniqueCarrier`

the unique carrier code, `FlightNum`

the flight number, `TailNum`

the plane tail number, `CRSElapsedTime`

expected flight time in minutes, `ArrDelay`

arrival delay, in minutes, `DepDelay`

departure delay in minutes, `Origin`

origin IATA airport code, `Dest`

destination IATA airport code, `Distance`

in miles.

Using this information, we want to see if it is possible to predict whether a flight will be delayed or not, making use of the information available before the departure. Therefore, in what follows, we binarise the `ArrDelay`

column, setting each value to True if the `ArrDelay`

is greater than zero, and False otherwise. Using this variable as our response, we perform logistic regression on the other covariates. The goal is to be able to do out-of-sample prediction and identify which variables influence delays the most.

We conducted Logistic Regression on the airplane dataset using both Spark and Tensorflow, summary of the analysis can be found: