Middleware 2018 - BigData

This is the Jupyter Notebook for the Big Data project of the Middleware course A.Y. 2017-2018.
The project has been developed by:

  • Lorenzo Petrangeli
  • Tommaso Sardelli
  • Philippe Scorsolini
In [53]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import holoviews as hv
import geoviews as gv
import cartopy

from cartopy import crs as ccrs

from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource

# Flights map
from matplotlib.colors import Normalize, LinearSegmentedColormap, PowerNorm

# Style
sns.set(rc={'figure.figsize':(15,5)})
sns.set_style("whitegrid")

hv.extension('bokeh')

Query 1 - Percentage of canceled flights per day

In [54]:
# Load data from csv
canceled_df = pd.read_csv('./spark/results/query1.csv/part-00000-d0ccdb07-2e51-4032-a5ec-3112928baee8-c000.csv', parse_dates=False)

# Build datetimes from Year, Month, Day columns
canceled_df.rename(columns={'DayofMonth': 'Day'}, inplace=True)
dates = pd.to_datetime(canceled_df[['Year', 'Month', 'Day']])

# Add Date to the original DataFrame
canceled_df = canceled_df.assign(Date=dates)

# Set Date as the new index column
canceled_df.set_index('Date', inplace=True)

Percentage of canceled flights per day

In the first query we analyse the percentage of canceled flights per day. Initially we propose a line plot with the raw data, without any kind of aggregation. In this way we can notice if there is any seasonal pattern that we want to investigate more or spikes in specifc days (e.g. 09/11).

In [55]:
# Create the plot figure
p = figure(plot_width=900, plot_height=400,
           title="Percentage of canceled flights per day", x_axis_type='datetime')

# Load the Dataframe as Bokeh DataSource
source = ColumnDataSource(canceled_df)

# Add a line plot
p.line('Date', 'percentageCancelled', source=source)
p.y_range.start = 0

# Labels
p.xaxis.axis_label = "Year"
p.yaxis.axis_label = "Percentage of canceled flights"

show(p)

Percentage of canceled flights aggregated per year

Here we show three aggregated views of canceled flights:

  1. Percentage of canceled flights aggregated per year
  2. Percentage of canceled flights aggregated per month choosing a specific year. In this case we choose year 2001 expecting it whould have shown peculiar results
  3. Same as the example above but this time we choose year 2002 to see the differences with respect to the previous year

In [56]:
# Drop columns we don't need
yearly_df = canceled_df.drop(columns=['Month', 'Day'])

# Create the plot
ax = sns.barplot(x="Year", y="percentageCancelled", data=yearly_df);

# Labels
ax.set_ylabel("Percentage of canceled flights")
ax.set_title("Percentage of canceled flights aggregated per year")

plt.show()

Percentage of canceled flights aggregated per month (2001)

In [57]:
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

# Select a single year
year_df = canceled_df.loc[canceled_df['Year'] == 2001]

# Drop columns we don't need
year_df = year_df.drop(columns=['Year', 'Day'])

# Create the plot
ax = sns.barplot(x="Month", y="percentageCancelled", data=year_df);

# Labels
ax.set_xticklabels(months)
ax.set_ylabel("Percentage of canceled flights")
ax.set_title("Percentage of canceled flights aggregated per month (2001)")

plt.show()

Percentage of canceled flights aggregated per month (2002)

In [58]:
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

# Select a single year
year_df = canceled_df.loc[canceled_df['Year'] == 2002]

# Drop columns we don't need
year_df = year_df.drop(columns=['Year', 'Day'])

# Create the plot
ax = sns.barplot(x="Month", y="percentageCancelled", data=year_df);

# Labels
ax.set_xticklabels(months)
ax.set_ylabel("Percentage of canceled flights")
ax.set_title("Percentage of canceled flights aggregated per month (2002)")

plt.show()

Percentage of canceled flights aggregated per month (1994/1995/1996)

In the last two plots we wanted to highlights two different kind of analysis. In the first case we compared three consecutive years aggregating the percentages of canceled flights per month and we wanted to spot any seasonal pattern. We noticed that in all years winter months, especially January and February have a higher rate of canceled flights. In the second plot we analysed the differences in percentage of canceled flights in the two years before and after 09/11/2001

In [59]:
# Vars
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
colors = ["#3498db", "#e74c3c", "#2ecc71"]

# Select a single year
year_df = canceled_df.loc[canceled_df.Year.isin([1994, 1995, 1996])]

# Drop columns we don't need
year_df = year_df.drop(columns=['Day'])


# Pointplot
pt = sns.pointplot(x="Month", y="percentageCancelled", data=year_df,
              ci=None, markers="", color="#444444");

# Barplot
ax = sns.barplot(x="Month", y="percentageCancelled", hue="Year",
                 data=year_df, ci=None, palette=sns.color_palette(colors));

# Baroplot Labels
ax.set_xticklabels(months)
ax.set_ylabel("Percentage of canceled flights")
ax.set_title("Percentage of canceled flights aggregated per month (1994/1995/1996)")

plt.show()

Percentage of canceled flights aggregated per month (2000/2002) before and after 09/11

In [60]:
# Vars
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
colors = ["#3498db", "#2ecc71"]

# Select a single year
year_df = canceled_df.loc[canceled_df.Year.isin([2000, 2002])]

# Drop columns we don't need
year_df = year_df.drop(columns=['Day'])

# Pointplot
sns.pointplot(x="Month", y="percentageCancelled", data=year_df,
              ci=None,color="#444444", markers="", );

# Barplot
ax = sns.barplot(x="Month", y="percentageCancelled", hue="Year",
                 data=year_df, ci=None, palette=sns.color_palette(colors));

# Barplot Labels
ax.set_xticklabels(months)
ax.set_ylabel("Percentage of canceled flights")
ax.set_title("Percentage of canceled flights before and after 2001")

plt.show()

Query 2 - Weekly percentages of delays that are due to weather

Percentage of flights delayed due to weather, aggregated per year

The second query shows percentages of flights dalayed due to bad weather conditions. This kind of information is present in the dataset only starting from 2003. We show the data in aggregated form first, with percentages displayed yearly. In a second interactive plot we can see weekly percentages for all the years.

In [61]:
# Load data from csv
weather_canceled_df = pd.read_csv('./spark/results/query2.csv/part-00000-8d3d0244-1725-42b2-8331-8b4eaff2cfa0-c000.csv',
                                  parse_dates=False, dtype={'week':'int'})

# Drop columns we don't need
weather_year_df = weather_canceled_df.drop(columns=['Week'])

# Create the plot
ax = sns.barplot(x="Year", y="percentage", data=weather_year_df);

# Labels
ax.set_ylabel("Percentage of delayed flights")
ax.set_title("Percentage of flights delayed due to weather, aggregated per year")

plt.show()

Percentage of flights delayed due to weather, aggregated per week for one year

In [62]:
from bokeh.models.tools import HoverTool
In [63]:
# Create the Holoviews Dataset starting from the DataFrame
weather_canceled_df.loc['percentage'] = weather_canceled_df['percentage'].astype(np.float64)
vdims = [('percentage', 'Percentage of delayed flights')]
ds = hv.Dataset(weather_canceled_df, ['Week', 'Year'], vdims)

# Create the Hover Tool
hover = HoverTool(tooltips=[("%", "@percentage{0.000}"), ("Week", "@Week")], mode='mouse')

# Colors
grey = "#95a5a6"
orange= "#e74c3c"
In [64]:
%%opts Curve [width=700 height=400 show_grid=True title_format="Percentage of flights delayed due to weather"]
%%opts Curve (color=grey line_width=1)
%%opts Scatter (color=grey size=5)
%%opts Scatter.Hover [tools=[hover]]
%%opts Scatter.Hover (size=20 color=orange alpha=0 hover_alpha=0.5)

# Create Curve and Scatter plots
curve = ds.to(hv.Curve, 'Week', 'percentage')
scatter = ds.to(hv.Scatter, 'Week', 'percentage', group='Normal')
scatter_hover = ds.to(hv.Scatter, 'Week', 'percentage', group='Hover')

# Merge and display
curve * scatter * scatter_hover
Out[64]:

Query 3 - Distance groups

Percentage of flights belonging to a given "distance group" that were able to halve their departure delays

In the third query we show percentages of flights that were able to halve their departure delays. Flights are split in distance groups that represents ranges of flight lenghts. The first plot shows just the aggregated percentages for each distance group. The second plot is interactive and shows the percentages with repsect to each airport and airports are geolocalized on the map.

In [65]:
# Load data from csv
distgroup_df = pd.read_csv('./spark/results/query3.csv/part-00000-ef959f33-da8c-4e1f-94f8-c7b7ff70381f-c000.csv', parse_dates=False)

# Create the plot
ax = sns.barplot(x="DistanceGroup", y="percentage", data=distgroup_df);

# Labels
ax.set_ylabel("Percentage of flights")
ax.set_xlabel("Distance Groups")
ax.set_title("Percentage of flights belonging to a distance group that halved their departure delays")

plt.show()

Distribution of flights in airports for each distance group

In [66]:
# Load airport coordinates
coords = pd.read_csv('./spark/results/query4.csv/res_with_lat_lon.csv/part-00000-7e572ce9-53fb-4a1e-9783-a43c3fb48685-c000.csv')
coords = coords[['Airport', 'lat', 'lon']]
coords.drop_duplicates(inplace=True)

# Load airport data
airports = pd.read_csv('./spark/results/query3.csv/extra/part-00000-3333ac90-17af-43e6-9822-dc993249c8b6-c000.csv')

# Merge DataFrames
traffic = pd.merge(airports, coords, left_on='Origin', right_on='Airport').drop(columns=['Origin'])

# Create the Dataset
traffic_ds = gv.Dataset(traffic, kdims=['DistanceGroup'])

# Load map tiles
wikipedia_tile = 'https://maps.wikimedia.org/osm-intl/{Z}/{X}/{Y}@2x.png'

# Create the Hover Tool
hover = HoverTool(tooltips=[("Number of flights", "@Total{0,0}"), ("Airport", "@Airport")], mode='mouse')
In [67]:
%%opts Overlay [width=700 height=700 title_format="Distribution of flights for each distance group" xaxis=None yaxis=None] 
%%opts Points (size=0.02 cmap='viridis') [tools=[hover] size_index=2 color_index=2]


# Create the plot
(gv.WMTS(wikipedia_tile) *\
traffic_ds.to(gv.Points, kdims=['lon', 'lat'],
              vdims=['Total', 'Airport'], crs=ccrs.PlateCarree()))
Out[67]:

Query 4 - Weekly Penalty

Weekly delay penalties for each airport

The fourth query shows the weekly penalties given to the airports for delayed flights. The plot is similar to the one proposed for the third query, airports are geolocalized on the map and years/weeks can be chosen interactively.

In [68]:
# Load data from csv
penalty_df = pd.read_csv('./spark/results/query4.csv/res_with_lat_lon.csv/part-00000-7e572ce9-53fb-4a1e-9783-a43c3fb48685-c000.csv', parse_dates=False)

# Create the Dataset
penalty_ds = gv.Dataset(penalty_df, kdims=['Year', 'Week'])

# Create the Hover Tool
hover = HoverTool(tooltips=[("Penalty", "@Penalty{0,0}"), ("Airport", "@Airport")], mode='mouse')
In [69]:
%%output max_frames=1000
%%opts Overlay [width=700 height=700 title_format="Weekly delay penalties" xaxis=None yaxis=None] 
%%opts Points (size=0.5 cmap='viridis') [tools=[hover] size_index=2 color_index=2]

(gv.WMTS(wikipedia_tile) *\
penalty_ds.to(gv.Points, kdims=['lon', 'lat'],
              vdims=['Penalty', 'Airport'], crs=ccrs.PlateCarree()))
Out[69]:

Query 5 (extra) - Busiest Airports and Flight Traffic

Busiest Airports

In this extra query we show two different kind of information. In the first plot we extracted the twenty busiest ariports and we displayed them in a chord plot. Each airport che be selected interactively to highlight the most frequent destinations for that airport

In [70]:
# Load data from csv
from_to_df = pd.read_csv('spark/results/extraQuery/from_to/part-00000-c6bf2e5e-f476-4ecf-81c5-383b25e030ff-c000.csv', parse_dates=False)
airports = pd.read_csv('spark/results/airports.csv')

route_counts = from_to_df.groupby(['Origin', 'Dest']).sum().sort_values('count(1)', ascending=False).drop(columns='Year').reset_index()

#  Get Origin and Destination IDs from IATA codes
origins = airports[['IATA', 'AirportID']].rename(columns={'AirportID': 'OriginID'})
dests = airports[['IATA', 'AirportID']].rename(columns={'AirportID': 'DestID'})

# Create the new DataFrame complete with IDs
airports_id = pd.merge(route_counts, origins, left_on='Origin', right_on='IATA').drop(columns='IATA')
airports_id = pd.merge(airports_id, dests, left_on='Dest', right_on='IATA').drop(columns='IATA')

# Create the chord elements
nodes = hv.Dataset(airports, 'AirportID', ['City', 'IATA'])
chord = hv.Chord((airports_id, nodes), ['OriginID', 'DestID'], ['count(1)'])

# Select the 20 busiest airports
busiest = list(airports_id.groupby('OriginID')['count(1)'].sum().sort_values(ascending=False).iloc[:20].index.values)#.OriginID
busiest_airports = chord.select(OriginID=busiest, selection_mode='nodes')
In [71]:
%%opts Chord [edge_color_index='OriginID' label_index='City' color_index='AirportID' width=800 height=800]
%%opts Chord (cmap='Category20' edge_cmap='Category20')
busiest_airports
Out[71]:

Flight Traffic

In the second query we displayed 1500 flights on the map in order to have a visual clue of the flight traffic through the US both in terms of quantity and spacial distribution.

In [72]:
# Load data from csv
from_to_df = pd.read_csv('spark/results/extraQuery/from_to/part-00000-c6bf2e5e-f476-4ecf-81c5-383b25e030ff-c000.csv', parse_dates=False)
airports = pd.read_csv('spark/results/airports.csv')

airports_coords = airports[['Latitude', 'Longitude', 'IATA']]

routes = pd.merge(airports_id, airports_coords, left_on='Origin', right_on='IATA').drop(columns='IATA')
routes.rename(columns={'Latitude': 'dep_lat', 'Longitude': 'dep_lon'}, inplace=True)

routes = pd.merge(routes, airports_coords, left_on='Dest', right_on='IATA').drop(columns='IATA')
routes.rename(columns={'Latitude': 'arr_lat', 'Longitude': 'arr_lon'}, inplace=True)

routes['count(1)'] = ((routes['count(1)']//100)+1).sort_values(ascending=False)

routes = routes.sort_values('count(1)', ascending=False).iloc[:1500]
In [73]:
import cartopy.feature as cpf

absolute = True
color_mode = 'screen'

if color_mode == 'screen':
    bg_color = (0.0, 0.0, 0, 1.0)
    coast_color = (204/255.0, 0, 153/255.0, 0.7)
## Purple
    color_list = [(0.0, 0.0, 0.0, 0.0),
                  (204/255.0, 0, 153/255.0, 0.6),
                  (255/255.0, 204/255.0, 230/255.0, 1.0)]

else:
    bg_color = (1.0, 1.0, 1.0, 1.0)
    coast_color = (10.0/255.0, 10.0/255.0, 10/255.0, 0.8)
    color_list = [(1.0, 1.0, 1.0, 0.0),
                  (255/255.0, 204/255.0, 230/255.0, 1.0),
                  (204/255.0, 0, 153/255.0, 0.2)
                  ]


fig = plt.figure(figsize=(30, 20))
    
ax = plt.axes(projection=ccrs.Miller())

ax.add_feature(cpf.COASTLINE)
ax.add_feature(cpf.LAND, color=bg_color)
ax.add_feature(cpf.OCEAN, color='grey')
ax.add_feature(cpf.BORDERS)
ax.add_feature(cpf.STATES)

num_routes = routes['count(1)'].sum()
#print(num_routes)

# normalize the dataset for color scale
norm = PowerNorm(0.3, routes['count(1)'].min(),routes['count(1)'].max())

# create a linear color scale with enough colors
if absolute:
    n = routes['count(1)'].max()
else:
    n = num_routes
cmap = LinearSegmentedColormap.from_list('cmap_flights', color_list, N=n)
    
# plot each route with its color depending on the number of flights
for i, route in enumerate(routes.sort_values(by='count(1)', ascending=True).iterrows()):
    route = route[1]
    if absolute:
        color = cmap(norm(int(route['count(1)'])))
    else:
        color = cmap(i * 1.0 / num_routes)
                
    plt.plot([route['dep_lon'], route['arr_lon']],
             [route['dep_lat'], route['arr_lat']],
             linewidth=1.5, color=color,  transform=ccrs.Geodetic())

plt.show()

Query 6 (extra) - Taxi Time

In the sixth and final query we show, for each airport, the total time (split by year) spent by flights for taxi in and taxi out. This is the time spent by the plane moving on the airport ground before it is allowed to take off (taxi out) or right after landing (taxi in).

In [74]:
# Load data from csv
taxi_time = pd.read_csv('spark/results/extraQuery/taxi/part-00000-55104509-131e-49e4-b6da-e315a5c91410-c000.csv')
taxi_time.head()

# Merge DataFrames
taxi_time_geo = pd.merge(taxi_time, coords, left_on='Origin', right_on='Airport').drop(columns=['Origin'])

# Create the Dataset
taxi_time_ds = gv.Dataset(taxi_time_geo, kdims=['Year'])

# Create two separate hover tools for taxi_in and taxi_out
hover_taxi_in = HoverTool(
    tooltips=[
        ("TaxiIn", "@TaxiIn{0,0}"),
        ("Airport", "@Airport")],
    mode='mouse')

hover_taxi_out = HoverTool(
    tooltips=[
        ("TaxiOut", "@TaxiOut{0,0}"),
        ("Airport", "@Airport")],
    mode='mouse')
In [75]:
%%output max_frames=1000
%%opts Overlay [width=700 height=700 title_format="Taxi Time" xaxis=None yaxis=None] 
%%opts Points.In (size=0.02 cmap='viridis') [tools=[hover_taxi_in] size_index=2 color_index=2]
%%opts Points.Out (size=0.02 cmap='viridis') [tools=[hover_taxi_out] size_index=2 color_index=2]

# Map
maptiles = gv.WMTS(wikipedia_tile)

# TaxiIn points chart
taxi_in = taxi_time_ds.to(gv.Points, kdims=['lon', 'lat'], vdims=['TaxiIn', 'Airport'],
                          crs=ccrs.PlateCarree(), label='Taxi In', group='In')
# TaxiOut points chart
taxi_out = taxi_time_ds.to(gv.Points, kdims=['lon', 'lat'], vdims=['TaxiOut', 'Airport'],
                           crs=ccrs.PlateCarree(), label='Taxi Out', group='Out')

# Combine map and point plots
maptiles * taxi_out * taxi_in
Out[75]: