Data Analysis and Visualization using Python

Data Analysis and Visualization using Python



Akkal Bahadu Bist akkalbist55@gmail.com Sikhar Tech Pvt. Ltd. (Data Scientist/AI Researcher)

Dependencies

Install Python

$ sudo apt-get install python3.5

Install Anaconda

Create an environment “ai” with all must-have libraries.

$ conda create -n ai python=3.5
$ source activate ai
$ conda install pandas matplotlib jupyter notebook scipy scikit nltk

Run jupyter notebook

$ jupyter noteboook

FIFA 2018 Worldcup Player Dataset

17k+ players, 70+ attributes extracted from the latest edition of FIFA. source: https://www.kaggle.com/thec03u5/fifa-18-demo-player-dataset, extended version
Dataset includes:

  • Player personal attributes (Nationality, Club, Photo, Age, Value etc.)
  • Player performance attributes (Overall, Potential, Aggression, Agility etc.)
  • Player preferred position and ratings at all positions.
Attributes:
  • Name: name of a player
  • Age: age of a player
  • Photo: picture of player
  • Nationality: player nationality
  • Flag
  • Overall
  • Potential
  • Club: The international club for which the player plays
  • Club Logo
  • Value
  • Wage
  • Special
  • Acceleration
  • Aggression
  • Agility
  • Balance
  • Ball control
  • Composure
  • Crossing
  • Curve
  • Dribbling
  • Finishing
  • Free kick accuracy
  • GK diving
  • GK handling
  • GK kicking
  • GK positioning
  • GK reflexes
  • Heading accuracy
  • Interceptions
  • Jumping
  • Long passing
  • Long shots
  • Marking
  • Penalties
  • Positioning
  • Reactions
  • Short passing
  • Shot power
  • Sliding tackle
  • Sprint speed
  • Stamina
  • Standing tackle
  • Strength
  • Vision
  • Volleys
  • CAM: Center Attacking Midfielder
  • CB: Center Back
  • CDM: Center Defensive Midfielder
  • CF: Center Forward
  • CM: Center Midfielder
  • ID: Player's ID in FIFA18
  • LAM: Left Attacking Midfielder
  • LB: Left Back
  • LCB: Left Center Back
  • LCM: Left Center Midfielder
  • LDM: Left Defensive Midfielder
  • LF: Left Forward
  • LM: Left Midfielder
  • LS: Left Striker
  • LW: Left Wing
  • LWB: Left Wing Back
  • Preferred Positions: Player's Preferred Position
  • RAM: Right Attacking Midfielder
  • RB: Right Back
  • RCB: Right Center Back
  • RCM: Right Center Midfielder
  • RDM: Right Defensive Midfielder
  • RF: Right Forward
  • RM: Right Midfielder
  • RS: Right Striker
  • RW: Right Wing
  • RWB: Right Wing Back
  • ST: Striker

Import libraries¶

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# import plotly modules
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go

from IPython.display import display
%matplotlib inline

Read dataset

dataset = pd.read_csv('CompleteDataset.csv')
/home/akkal/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2785: DtypeWarning:

Columns (23,35) have mixed types. Specify dtype option on import or set low_memory=False.

Quick Overview

display(dataset.head())
Unnamed: 0 Name Age Photo Nationality Flag Overall Potential Club Club Logo ... RB RCB RCM RDM RF RM RS RW RWB ST
0 0 Cristiano Ronaldo 32 https://cdn.sofifa.org/48/18/players/20801.png Portugal https://cdn.sofifa.org/flags/38.png 94 94 Real Madrid CF https://cdn.sofifa.org/24/18/teams/243.png ... 61.0 53.0 82.0 62.0 91.0 89.0 92.0 91.0 66.0 92.0
1 1 L. Messi 30 https://cdn.sofifa.org/48/18/players/158023.png Argentina https://cdn.sofifa.org/flags/52.png 93 93 FC Barcelona https://cdn.sofifa.org/24/18/teams/241.png ... 57.0 45.0 84.0 59.0 92.0 90.0 88.0 91.0 62.0 88.0
2 2 Neymar 25 https://cdn.sofifa.org/48/18/players/190871.png Brazil https://cdn.sofifa.org/flags/54.png 92 94 Paris Saint-Germain https://cdn.sofifa.org/24/18/teams/73.png ... 59.0 46.0 79.0 59.0 88.0 87.0 84.0 89.0 64.0 84.0
3 3 L. Suárez 30 https://cdn.sofifa.org/48/18/players/176580.png Uruguay https://cdn.sofifa.org/flags/60.png 92 92 FC Barcelona https://cdn.sofifa.org/24/18/teams/241.png ... 64.0 58.0 80.0 65.0 88.0 85.0 88.0 87.0 68.0 88.0
4 4 M. Neuer 31 https://cdn.sofifa.org/48/18/players/167495.png Germany https://cdn.sofifa.org/flags/21.png 92 92 FC Bayern Munich https://cdn.sofifa.org/24/18/teams/21.png ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 rows × 75 columns

Select interested columns only¶

int_col = [
    'Name', 
    'Age', 
    'Photo', 
    'Nationality', 
    'Overall', 
    'Potential', 
    'Club', 
    'Value', 
    'Wage', 
    'Preferred Positions'
]
dataset = pd.DataFrame(dataset, columns=int_col)
display(dataset.head())
Name Age Photo Nationality Overall Potential Club Value Wage Preferred Positions
0 Cristiano Ronaldo 32 https://cdn.sofifa.org/48/18/players/20801.png Portugal 94 94 Real Madrid CF €95.5M €565K ST LW
1 L. Messi 30 https://cdn.sofifa.org/48/18/players/158023.png Argentina 93 93 FC Barcelona €105M €565K RW
2 Neymar 25 https://cdn.sofifa.org/48/18/players/190871.png Brazil 92 94 Paris Saint-Germain €123M €280K LW
3 L. Suárez 30 https://cdn.sofifa.org/48/18/players/176580.png Uruguay 92 92 FC Barcelona €97M €510K ST
4 M. Neuer 31 https://cdn.sofifa.org/48/18/players/167495.png Germany 92 92 FC Bayern Munich €61M €230K GK

Preprocessing

# Drop NA values
dataset = dataset.dropna()
# Numeric columns of Value and Wage
def str2number(amount):
    if amount[-1] == 'M':
        return float(amount[1:-1])*1000000
    elif amount[-1] == 'K':
        return float(amount[1:-1])*1000
    else:
        return float(amount[1:])
    
dataset['ValueNum'] = dataset['Value'].apply(lambda x: str2number(x))
dataset['WageNum'] = dataset['Wage'].apply(lambda x: str2number(x))
display(dataset.head())
Name Age Photo Nationality Overall Potential Club Value Wage Preferred Positions ValueNum WageNum
0 Cristiano Ronaldo 32 https://cdn.sofifa.org/48/18/players/20801.png Portugal 94 94 Real Madrid CF €95.5M €565K ST LW 95500000.0 565000.0
1 L. Messi 30 https://cdn.sofifa.org/48/18/players/158023.png Argentina 93 93 FC Barcelona €105M €565K RW 105000000.0 565000.0
2 Neymar 25 https://cdn.sofifa.org/48/18/players/190871.png Brazil 92 94 Paris Saint-Germain €123M €280K LW 123000000.0 280000.0
3 L. Suárez 30 https://cdn.sofifa.org/48/18/players/176580.png Uruguay 92 92 FC Barcelona €97M €510K ST 97000000.0 510000.0
4 M. Neuer 31 https://cdn.sofifa.org/48/18/players/167495.png Germany 92 92 FC Bayern Munich €61M €230K GK 61000000.0 230000.0
# Categorical columns of Value and Wage
max_value = float(dataset['ValueNum'].max() + 1)
max_wage = float(dataset['WageNum'].max() + 1)

print("Max value:", max_value, "Max_wage:", max_wage)

# Supporting function for creating category columns 'ValueCategory' and 'WageCategory'
def mappingAmount(x, max_amount):
    for i in range(0, 10):
        if x >= max_amount/10*i and x < max_amount/10*(i+1):
            return i
        
dataset['ValueCategory'] = dataset['ValueNum'].apply(lambda x: mappingAmount(x, max_value))
dataset['WageCategory'] = dataset['WageNum'].apply(lambda x: mappingAmount(x, max_wage))
Max value: 123000001.0 Max_wage: 565001.0
display(dataset.head())
Name Age Photo Nationality Overall Potential Club Value Wage Preferred Positions ValueNum WageNum ValueCategory WageCategory
0 Cristiano Ronaldo 32 https://cdn.sofifa.org/48/18/players/20801.png Portugal 94 94 Real Madrid CF €95.5M €565K ST LW 95500000.0 565000.0 7 9
1 L. Messi 30 https://cdn.sofifa.org/48/18/players/158023.png Argentina 93 93 FC Barcelona €105M €565K RW 105000000.0 565000.0 8 9
2 Neymar 25 https://cdn.sofifa.org/48/18/players/190871.png Brazil 92 94 Paris Saint-Germain €123M €280K LW 123000000.0 280000.0 9 4
3 L. Suárez 30 https://cdn.sofifa.org/48/18/players/176580.png Uruguay 92 92 FC Barcelona €97M €510K ST 97000000.0 510000.0 7 9
4 M. Neuer 31 https://cdn.sofifa.org/48/18/players/167495.png Germany 92 92 FC Bayern Munich €61M €230K GK 61000000.0 230000.0 4 4
# Add two categories 0 and 1 and inform if player value/wage is highier then mean value
mean_value = float(dataset["ValueNum"].mean())
mean_wage = float(dataset["WageNum"].mean())

# Supporting function for creating category columns 'OverMeanValue' and 'OverMeanWage'
def overValue(x, limit):
    if x > limit:
        return 1
    else:
        return 0
    
dataset['OverMeanValue'] = dataset['ValueNum'].apply(lambda x: overValue(x, mean_value))
dataset['OverMeanWage'] = dataset['WageNum'].apply(lambda x: overValue(x, mean_wage))
display(dataset.head())
Name Age Photo Nationality Overall Potential Club Value Wage Preferred Positions ValueNum WageNum ValueCategory WageCategory OverMeanValue OverMeanWage
0 Cristiano Ronaldo 32 https://cdn.sofifa.org/48/18/players/20801.png Portugal 94 94 Real Madrid CF €95.5M €565K ST LW 95500000.0 565000.0 7 9 1 1
1 L. Messi 30 https://cdn.sofifa.org/48/18/players/158023.png Argentina 93 93 FC Barcelona €105M €565K RW 105000000.0 565000.0 8 9 1 1
2 Neymar 25 https://cdn.sofifa.org/48/18/players/190871.png Brazil 92 94 Paris Saint-Germain €123M €280K LW 123000000.0 280000.0 9 4 1 1
3 L. Suárez 30 https://cdn.sofifa.org/48/18/players/176580.png Uruguay 92 92 FC Barcelona €97M €510K ST 97000000.0 510000.0 7 9 1 1
4 M. Neuer 31 https://cdn.sofifa.org/48/18/players/167495.png Germany 92 92 FC Bayern Munich €61M €230K GK 61000000.0 230000.0 4 4 1 1
# Potential points
dataset['PotentialPoints'] = dataset['Potential'] - dataset['Overall']
display(dataset.head())
Name Age Photo Nationality Overall Potential Club Value Wage Preferred Positions ValueNum WageNum ValueCategory WageCategory OverMeanValue OverMeanWage PotentialPoints
0 Cristiano Ronaldo 32 https://cdn.sofifa.org/48/18/players/20801.png Portugal 94 94 Real Madrid CF €95.5M €565K ST LW 95500000.0 565000.0 7 9 1 1 0
1 L. Messi 30 https://cdn.sofifa.org/48/18/players/158023.png Argentina 93 93 FC Barcelona €105M €565K RW 105000000.0 565000.0 8 9 1 1 0
2 Neymar 25 https://cdn.sofifa.org/48/18/players/190871.png Brazil 92 94 Paris Saint-Germain €123M €280K LW 123000000.0 280000.0 9 4 1 1 2
3 L. Suárez 30 https://cdn.sofifa.org/48/18/players/176580.png Uruguay 92 92 FC Barcelona €97M €510K ST 97000000.0 510000.0 7 9 1 1 0
4 M. Neuer 31 https://cdn.sofifa.org/48/18/players/167495.png Germany 92 92 FC Bayern Munich €61M €230K GK 61000000.0 230000.0 4 4 1 1 0
# Preferred position
#To make things simpler we select first position from list as preferred
dataset['Position'] = dataset['Preferred Positions'].str.split().str[0]
dataset['PositionNum'] = dataset['Preferred Positions'].apply(lambda x: len(x.split()))
display(dataset.head())
Name Age Photo Nationality Overall Potential Club Value Wage Preferred Positions ValueNum WageNum ValueCategory WageCategory OverMeanValue OverMeanWage PotentialPoints Position PositionNum
0 Cristiano Ronaldo 32 https://cdn.sofifa.org/48/18/players/20801.png Portugal 94 94 Real Madrid CF €95.5M €565K ST LW 95500000.0 565000.0 7 9 1 1 0 ST 2
1 L. Messi 30 https://cdn.sofifa.org/48/18/players/158023.png Argentina 93 93 FC Barcelona €105M €565K RW 105000000.0 565000.0 8 9 1 1 0 RW 1
2 Neymar 25 https://cdn.sofifa.org/48/18/players/190871.png Brazil 92 94 Paris Saint-Germain €123M €280K LW 123000000.0 280000.0 9 4 1 1 2 LW 1
3 L. Suárez 30 https://cdn.sofifa.org/48/18/players/176580.png Uruguay 92 92 FC Barcelona €97M €510K ST 97000000.0 510000.0 7 9 1 1 0 ST 1
4 M. Neuer 31 https://cdn.sofifa.org/48/18/players/167495.png Germany 92 92 FC Bayern Munich €61M €230K GK 61000000.0 230000.0 4 4 1 1 0 GK 1
# Continent

# Function matching continent to countries
def find_continent(x, continents_list):
    # Iteration over 
    for key in continents_list:
        if x in continents_list[key]:
            return key
    return np.NaN
dataset['Continent'] = dataset['Nationality'].apply(lambda x: find_continent(x, continents))
display(dataset.head())
Name Age Photo Nationality Overall Potential Club Value Wage Preferred Positions ValueNum WageNum ValueCategory WageCategory OverMeanValue OverMeanWage PotentialPoints Position PositionNum Continent
0 Cristiano Ronaldo 32 https://cdn.sofifa.org/48/18/players/20801.png Portugal 94 94 Real Madrid CF €95.5M €565K ST LW 95500000.0 565000.0 7 9 1 1 0 ST 2 Europe
1 L. Messi 30 https://cdn.sofifa.org/48/18/players/158023.png Argentina 93 93 FC Barcelona €105M €565K RW 105000000.0 565000.0 8 9 1 1 0 RW 1 South America
2 Neymar 25 https://cdn.sofifa.org/48/18/players/190871.png Brazil 92 94 Paris Saint-Germain €123M €280K LW 123000000.0 280000.0 9 4 1 1 2 LW 1 South America
3 L. Suárez 30 https://cdn.sofifa.org/48/18/players/176580.png Uruguay 92 92 FC Barcelona €97M €510K ST 97000000.0 510000.0 7 9 1 1 0 ST 1 South America
4 M. Neuer 31 https://cdn.sofifa.org/48/18/players/167495.png Germany 92 92 FC Bayern Munich €61M €230K GK 61000000.0 230000.0 4 4 1 1 0 GK 1 Europe

Datasetset Visualization Age/Overall/Potential

Grouping players by Nationality - D3 chart

Referece: Zoomable Circle Packing via D3.js in IPython

Next Continue ....=>

Comments

Popular Posts