FIFA Soccer Players’ Value Analysis

Group Member: Haochen Chen, Maria Chen

Dataset:

Kaggle Competition Datasets

  1. FIFA Players 2019 data:https://www.kaggle.com/datasets/javagarm/fifa-19-complete-player-dataset
  2. FIFA Players 2020 data: https://www.kaggle.com/datasets/sagunsh/fifa-20-complete-player-dataset
  3. FIFA Players 2021 data:https://www.kaggle.com/datasets/umeshkumar017/fifa-21-player-and-formation-analysis

Introduction of Project:

Now people pay more and more attention to the changes in football. FIFA is the International Association Football Federation, an organization to ensure fair competition for players from all countries. FIFA released the football game of the same name, and all the data in the game come from real data. Each player in the game will have different worth, salary and skill rating. Players can choose to buy different player cards to form their own team to win the game. This project mainly analyzes the factors that change the value of players to help users get the highest value player cards at the lowest cost. We use the three year datas, cause we want to see whether or not Covide affect the players' value, and what will change accross these three years.

Simple Timeline for Project:

Our team will have a meeting on Fridays every week to discuss each other's completion, problems encountered and solutions. Maria Chen will be mainly responsible for the analysis of Player in the project, and Haochen Chen will be mainly responsible for the analysis of Team in the project. After both parties have completed the analysis of their respective databases, we will integrate the data from our analysis and finally work together to predict the future performance of players and teams.

Connect and Import connect with google drive, and import the basic funtion.

In [48]:
# Connect our notebook to the Drive
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/CMPS3160_Project
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/CMPS3160_Project
In [49]:
#Import some package we will use in the future process
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns; sns.set_theme()
from scipy import stats, integrate
from collections import Counter
from sklearn.utils import shuffle

#origanl table FIFA 19-21 data; data 1 = FIFA 19 data 
data190 = pd.read_csv('/content/drive/MyDrive/CMPS3160_Project/data/1.csv')

data_2=pd.read_csv('/content/drive/MyDrive/CMPS3160_Project/data/2.csv')

data210 = pd.read_csv('/content/drive/MyDrive/CMPS3160_Project/data/3.csv')

# Save the change csv to the new csv which not include the string in the Value, Age, Wage
path1 = "/content/drive/MyDrive/CMPS3160_Project/data/19.csv"
data19 = pd.read_csv(path1, sep=",",encoding = "ISO-8859-1")

path2 = "/content/drive/MyDrive/CMPS3160_Project/data/20.csv"
data20 = pd.read_csv(path2, sep=",",encoding = "ISO-8859-1")

path3 = "/content/drive/MyDrive/CMPS3160_Project/data/21.csv"
data21 = pd.read_csv(path3, sep=",",encoding = "ISO-8859-1")
/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py:3326: DtypeWarning:

Columns (16,86) have mixed types.Specify dtype option on import or set low_memory=False.

/usr/local/lib/python3.8/dist-packages/IPython/core/interactiveshell.py:3326: DtypeWarning:

Columns (74) have mixed types.Specify dtype option on import or set low_memory=False.

Table Introduction

We have collected FIFA player data over the past three years to analyze what factors affect a player's value. The main data we need are that payers' name, age, value, wage, country, club, position and overall score. The overall score is the FIFA game scording system. There are six main criteria which appear alongside the overall score: speed, shooting, passing, defending, dribbling and physicality.

Key Factor Definition: Value: Value of a soccer player card in FIFA is an estimate of the amount for which the owner can sell the player card to another user. Or the amount for which the user is willing to pay for the player card. Wage: Wage of a soccer player is the amount of money that is regularly paid to them by the soccer club they work for.

Table1: (data19) FIFA Players 2019 data

In [50]:
data19.head()
Out[50]:
Unnamed: 0 ID Name Age Photo Nationality Flag Overall Potential Club ... Composure Marking StandingTackle SlidingTackle GKDiving GKHandling GKKicking GKPositioning GKReflexes Release Clause
0 0 158023 L. Messi 31.0 https://cdn.sofifa.org/players/4/19/158023.png Argentina https://cdn.sofifa.org/flags/52.png 94.0 94 FC Barcelona ... 96.0 33.0 28.0 26.0 6.0 11.0 15.0 14.0 8.0 226.5M
1 1 20801 Cristiano Ronaldo 33.0 https://cdn.sofifa.org/players/4/19/20801.png Portugal https://cdn.sofifa.org/flags/38.png 94.0 94 Juventus ... 95.0 28.0 31.0 23.0 7.0 11.0 15.0 14.0 11.0 127.1M
2 2 190871 Neymar Jr 26.0 https://cdn.sofifa.org/players/4/19/190871.png Brazil https://cdn.sofifa.org/flags/54.png 92.0 93 Paris Saint-Germain ... 94.0 27.0 24.0 33.0 9.0 9.0 15.0 15.0 11.0 228.1M
3 3 193080 De Gea 27.0 https://cdn.sofifa.org/players/4/19/193080.png Spain https://cdn.sofifa.org/flags/45.png 91.0 93 Manchester United ... 68.0 15.0 21.0 13.0 90.0 85.0 87.0 88.0 94.0 138.6M
4 4 192985 K. De Bruyne 27.0 https://cdn.sofifa.org/players/4/19/192985.png Belgium https://cdn.sofifa.org/flags/7.png 91.0 92 Manchester City ... 88.0 68.0 58.0 51.0 15.0 13.0 5.0 10.0 13.0 196.4M

5 rows × 89 columns

Table 2: (data20) FIFA Players 2020 data

In [51]:
data20.head()
Out[51]:
Name Image Country Position Age Overall Potential Club ID Height ... A/W D/W IR PAC SHO PAS DRI DEF PHY Hits
0 Lionel Messi https://cdn.sofifa.org/players/4/20/158023.png Argentina RW,CF,ST 32 94 94 FC Barcelona 158023 5'7" ... Medium Low 5 87 92 92 96 39 66 585
1 C. Ronaldo dos Santos Aveiro https://cdn.sofifa.org/players/4/20/20801.png Portugal ST,LW 34 93 93 Juventus 20801 6'2" ... High Low 5 90 93 82 89 35 78 448
2 Neymar da Silva Santos Jr. https://cdn.sofifa.org/players/4/20/190871.png Brazil LW,CAM 27 92 92 Paris Saint-Germain 190871 5'9" ... High Medium 5 91 85 87 95 32 58 432
3 Jan Oblak https://cdn.sofifa.org/players/4/20/200389.png Slovenia GK 26 91 91 Atlético Madrid 200389 6'2" ... Medium Medium 3 87 92 78 89 52 90 240
4 Kevin De Bruyne https://cdn.sofifa.org/players/4/20/192985.png Belgium CAM,CM 28 91 91 Manchester City 192985 5'11" ... High High 4 76 86 92 86 61 78 298

5 rows × 75 columns

Table 3: (data21) FIFA Player 2020 data

In [52]:
data21.head()
Out[52]:
 ID Name Age Photo Nationality Flag Overall Potential Club ... Penalties Composure Defensive Awareness Standing Tackle Sliding Tackle GK Diving GK Handling GK Kicking GK Positioning GK Reflexes
0 0 253283 Facundo Pellistri 18 https://cdn.sofifa.com/players/253/283/20_60.png Uruguay https://cdn.sofifa.com/flags/uy.png 71 87 Peñarol ... 66.0 61.0 35.0 11.0 18.0 9.0 12.0 7.0 8.0 7.0
1 1 179813 Edinson Cavani 32 https://cdn.sofifa.com/players/179/813/20_60.png Uruguay https://cdn.sofifa.com/flags/uy.png 86 86 Paris Saint-Germain ... 85.0 80.0 57.0 48.0 39.0 12.0 5.0 13.0 13.0 10.0
2 2 245541 Giovanni Reyna 17 https://cdn.sofifa.com/players/245/541/20_60.png United States https://cdn.sofifa.com/flags/us.png 68 87 Borussia Dortmund ... 50.0 59.0 30.0 23.0 24.0 10.0 13.0 14.0 12.0 7.0
3 3 233419 Raphael Dias Belloli 23 https://cdn.sofifa.com/players/233/419/20_60.png Brazil https://cdn.sofifa.com/flags/br.png 81 85 Stade Rennais FC ... 73.0 79.0 45.0 54.0 38.0 8.0 7.0 13.0 8.0 14.0
4 4 198710 James Rodríguez 28 https://cdn.sofifa.com/players/198/710/20_60.png Colombia https://cdn.sofifa.com/flags/co.png 82 82 Everton ... 81.0 87.0 52.0 41.0 44.0 15.0 15.0 15.0 5.0 14.0

5 rows × 92 columns

ETL

Cause we want to anlysis which factors will affect the players' value change. First, we need to clean the data, which make us easy to anlysis the data in the future steps. We will drop the NaN Age, Country, Club, Value, Wage. This step is very important. Because in the orignal Excel sheet, like the unknow age will appear 0. These data will affect our caculation in the later process.

In [53]:
data19['Age'] = data19['Age'].drop(0)
data20['Age'] = data20['Age'].drop(0)
data21['Age'] = data21['Age'].drop(0)
data19['Nationality'].fillna('No', inplace = True)
data20['Country'].fillna('No', inplace = True)
data21['Nationality'].fillna('No', inplace = True)
data19['Club'].fillna('No', inplace = True)
data20['Club'].fillna('No', inplace = True)
data21['Club'].fillna('No', inplace = True)
data19["Value"] = data19.Value.replace(0,np.nan)
data20["Value"] = data20.Value.replace(0,np.nan)
data21["Value"] = data21.Value.replace(0,np.nan)
data19["Wage"] = data19.Wage.replace(0,np.nan)
data20["Wage"] = data20.Wage.replace(0,np.nan)
data21["Wage"] = data21.Wage.replace(0,np.nan)

In the next step, we want to merge all useful data into our create own table. We want a table will have players' Name, every year Age, Club, Nationality, Value, Wage. Because we think these factor may affect the value change in the future. This step can help us more easier to anlysis the data in the future steps.

First, We create a new table call "information19", this table mainly display the players' information in the FIFA 2019.

In [54]:
information19 = pd.DataFrame()
information19["Name"] = data19["Name"]
information19["19Age"]= data19["Age"]
information19["19Club"] = data19["Club"]
information19["19Nationality"] = data19["Nationality"]
information19["19Value"] = data19["Value"]
information19["19Wage"] = data19["Wage"]

information19.head()
Out[54]:
Name 19Age 19Club 19Nationality 19Value 19Wage
0 L. Messi NaN FC Barcelona Argentina 110500000.0 565000.0
1 Cristiano Ronaldo 33.0 Juventus Portugal 77000000.0 405000.0
2 Neymar Jr 26.0 Paris Saint-Germain Brazil 118500000.0 290000.0
3 De Gea 27.0 Manchester United Spain 72000000.0 260000.0
4 K. De Bruyne 27.0 Manchester City Belgium 102000000.0 355000.0

Second, We create a new table call "information20", this table mainly display the players' information in the FIFA 2020.

In [55]:
information20 = pd.DataFrame()
information20["Name"] = data20["Name"]
information20["20Age"]= data20["Age"]
information20["20Club"] = data20["Club"]
information20["20Nationality"] = data20["Country"]
information20["20Value"] = data20["Value"]
information20["20Wage"] = data20["Wage"]

information20.head()
Out[55]:
Name 20Age 20Club 20Nationality 20Value 20Wage
0 Lionel Messi NaN FC Barcelona Argentina 95500000.0 565000.0
1 C. Ronaldo dos Santos Aveiro 34.0 Juventus Portugal 58500000.0 405000.0
2 Neymar da Silva Santos Jr. 27.0 Paris Saint-Germain Brazil 105500000.0 290000.0
3 Jan Oblak 26.0 Atlético Madrid Slovenia 77500000.0 125000.0
4 Kevin De Bruyne 28.0 Manchester City Belgium 90000000.0 370000.0

Third, We create a new table call "information21", this table mainly display the players' information in the FIFA 2021.

In [56]:
information21 = pd.DataFrame()
information21["Name"] = data21["Name"]
information21["21Age"] = data21["Age"]
information21["21Club"] = data21["Club"]
information21["21Nationality"] = data21["Nationality"]
information21["21Value"] = data21["Value"]
information21["21Wage"] = data21["Wage"]

information21.head()
Out[56]:
Name 21Age 21Club 21Nationality 21Value 21Wage
0 Facundo Pellistri NaN Peñarol Uruguay 4900000.0 500.0
1 Edinson Cavani 32.0 Paris Saint-Germain Uruguay 35500000.0 150000.0
2 Giovanni Reyna 17.0 Borussia Dortmund United States 1800000.0 2000.0
3 Raphael Dias Belloli 23.0 Stade Rennais FC Brazil 23000000.0 50000.0
4 James Rodríguez 28.0 Everton Colombia 22500000.0 105000.0

We will use these three tables to do the data compare. We want to see whether or not the covid affect the players' value change during this three year. Next step we will do some simple data anlysis about comparing players' three year factor change.

This step we want to compare the players’ three year age change. When do this part data analysis, we discover a problem show in the original tabular data. Some players’ age change will not increase with year change. This may be due to statistical errors in the original tabular data collection. But we mainly want to see the age distribution of the players. What is the age range of most players, and how old are the oldest and youngest players?

In [57]:
fig, axes = plt.subplots(1,3,figsize = (12,4),sharey = True)
fig.tight_layout(h_pad =4)
sns.set(color_codes=True)
information19["19Age"].plot.hist(ax= axes[0]).set_title("19Age")
information20["20Age"].plot.hist(ax= axes[1]).set_title("20Age")
information21["21Age"].plot.hist(ax= axes[2]).set_title("21Age")
Out[57]:
Text(0.5, 1.0, '21Age')

In these three tables, we can see how often the players are distributed over three years of age. We can find that basically the age with the largest number of players every year is around 25 years old. The oldest player will not exceed 45 years old, and the youngest player is basically 16 years old. This shows that although it changes every year, there will always be new young players joining. Football has relatively high age requirements for players. Will the value of young players be lower than that of experienced players? Is the player's age directly proportional to the player's value? We will conduct a more in-depth analysis later.

Next we also want to see whether the salaries of players have changed a lot under the influence of covid. We first normalized the salaries to ensure that the final chart display allows people to see the distribution of players' salaries at a glance.

In [58]:
fig, axes = plt.subplots(1,3,figsize = (12,4),sharey = True)
fig.tight_layout(h_pad =4)
sns.set(color_codes=True)
log19Wage = np.log(information19["19Wage"])
log20Wage = np.log(information20["20Wage"])
log21Wage = np.log(information21["21Wage"])
log19Wage.plot.hist(ax= axes[0]).set_title("19Wage")
log20Wage.plot.hist(ax= axes[1]).set_title("20Wage")
log21Wage.plot.hist(ax= axes[2]).set_title("21Wage")
Out[58]:
Text(0.5, 1.0, '21Wage')

According to these three tables, we can analyze the salary distribution changes in the past three years. We can find that in 2019 and 20, most of the salary distribution is gathered on the far left. Perhaps because of the epidemic, the salaries of players have not changed much, and more players' salaries have been reduced. By 21 years, we can find that the salaries of many originally low-paid players have increased, but the maximum salary of players has decreased.

Next, we want to analyze the changes in the value of players in the past three years. Whether the player's value has been reduced or increased by the impact of covid. At the same time, we also want to analyze the value distribution of players, trying to find out whether changes in player values are related to changes in other factors.

In [59]:
fig, axes = plt.subplots(1,3,figsize = (12,4),sharey = True)
fig.tight_layout(h_pad =4)
sns.set(color_codes=True)
log19Value = np.log(information19["19Value"])
log20Value = np.log(information20["20Value"])
log21Value = np.log(information21["21Value"])
log19Value.plot.hist(ax = axes[0]).set_title("19Value")
log20Value.plot.hist(ax = axes[1]).set_title("20Value")
log21Value.plot.hist(ax = axes[2]).set_title("21Value")
Out[59]:
Text(0.5, 1.0, '21Value')

We first normalize the player values. Let's make our chart display easier to see the exact distribution. We find that the value of players generally increases with age for the most part. The 20-year increase in player value is not obvious. Perhaps because of covid, most players did not have the opportunity to endorse or play, so the value of the players has not changed. In 21 years, most low-value players have grown.

We also speculate whether players from some countries will be worth more. Country is likely to be an important factor affecting player value changes.

In [60]:
information19.groupby(["19Nationality"])["19Value"].mean().plot.bar(figsize=(60,25))
Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f881b242d30>
In [61]:
information20.groupby(["20Nationality"])["20Value"].mean().plot.bar(figsize=(60,25))
Out[61]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f881b0a9a00>
In [62]:
information21.groupby(["21Nationality"])["21Value"].mean().plot.bar(figsize=(60,25))
Out[62]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8816320160>

When we analyze the relationship between player country and player value, we try to first calculate the average value of all players in each country. At this time, we found that some countries have a small number of players, but the value of each player is very high. This will cause the average value of all players in the country to be high, and it is impossible to accurately analyze which countries' players have higher value. Because the number of players in each country is not guaranteed to be consistent. In addition, we also found that some countries have missing player value data. In the following analysis, we will solve this problem to analyze the relationship between player country and value.

EDA

Next, we will try to find out the correlation between the player's value and his age and wage. We firstly think that there are some close connections between the player's value and his age and wage. We will use several graphs to prove and illustrate our ideas.

In [63]:
features = ["Value","Age","Wage"]
corr = data19[features].corr()
corr
sns.heatmap(corr)
Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8815fbf7f0>

From the correlation graph, we can easily find out the results that the player's value has high correlation with the player's wage, but it has almost no connection to the player's age.

In [81]:
# Save the change csv to the new csv which not include the string in the Value, Age, Wage
path1 = "/content/drive/MyDrive/CMPS3160_Project/data/19.csv"
data19 = pd.read_csv(path1, sep=",",encoding = "ISO-8859-1")

path2 = "/content/drive/MyDrive/CMPS3160_Project/data/20.csv"
data20 = pd.read_csv(path2, sep=",",encoding = "ISO-8859-1")

path3 = "/content/drive/MyDrive/CMPS3160_Project/data/21.csv"
data21 = pd.read_csv(path3, sep=",",encoding = "ISO-8859-1")
In [65]:
import plotly.express as px
nat_cnt=data19.groupby('Nationality').apply(lambda x:x['Name'].count()).reset_index(name='Counts')
nat_cnt.sort_values(by='Counts',ascending=False,inplace=True)
top_20_nat_cnt=nat_cnt[:20]
fig=px.bar(top_20_nat_cnt,x='Nationality',y='Counts',color='Counts',title='Nationwise Representation in the FIFA Game')
fig.show()
In [66]:
import plotly.express as px
cost_prop=data21[['Name','Club','Nationality','Wage','Value','Position']]
fig=px.scatter(cost_prop,x='Value',y='Wage',color='Value',size='Wage',hover_data=['Name','Club','Nationality','Position'],title='Value vs Wage Presentation of all the Players')
fig.show()

From the scatter graph above, we can see that the player who has higher wage will have the higher value in the FIFA, also the player who has lower wage will have the lower value in the FIFA. (The graph shows an linear interaction with stimated positive slope.)

After finding out the relationship between value and wage, we want to explore the relationships between the player's value and his position. We want to find out which position of the player has higher value than others? This provides us with a basis going forward. Let’s see how every position takes a bite out of the pie.

In [67]:
y = data19.groupby("Position")[["Value"]].sum()
y = y.reset_index()
y.groupby("Position")[["Value"]].sum()
y.sort_values("Value",ascending=False,inplace=True)

mylabels = y["Position"]
ys = y["Value"]

percent = 100.*ys/ys.sum()
patches, texts = plt.pie(ys, startangle=90, radius=1.2,shadow = True)
labels = ['{0} - {1:1.2f} %'.format(i,j) for i,j in zip(mylabels,percent)]

sort_legend = True
if sort_legend:
    patches, labels, dummy =  zip(*sorted(zip(patches, labels, ys),
                                          key=lambda x: x[2],
                                          reverse=True))

plt.legend(patches, labels, loc='upper right', bbox_to_anchor=(-0.1, 1.),
           fontsize=8)
plt.title("FIFA Average value Percentage Per Position in 2019")

plt.show()
In [68]:
y = data20.groupby("BP")[["Value"]].sum()
y = y.reset_index()
y.groupby("BP")[["Value"]].sum()
y.sort_values("Value",ascending=False,inplace=True)

mylabels = y["BP"]
ys = y["Value"]

percent = 100.*ys/ys.sum()
patches, texts = plt.pie(ys, startangle=90, radius=1.2,shadow = True)
labels = ['{0} - {1:1.2f} %'.format(i,j) for i,j in zip(mylabels,percent)]

sort_legend = True
if sort_legend:
    patches, labels, dummy =  zip(*sorted(zip(patches, labels, ys),
                                          key=lambda x: x[2],
                                          reverse=True))

plt.legend(patches, labels, loc='upper right', bbox_to_anchor=(-0.1, 1.),
           fontsize=8)
plt.title("FIFA Average value Percentage Per Position in 2020")

plt.show()
In [69]:
y = data21.groupby("Position")[["Value"]].sum()
y = y.reset_index()
y.groupby("Position")[["Value"]].sum()
y.sort_values("Value",ascending=False,inplace=True)

mylabels = y["Position"]
ys = y["Value"]

percent = 100*ys/ys.sum()
patches, texts = plt.pie(ys, startangle=90, radius=1.2,shadow = True)
labels = ['{0} - {1:1.2f} %'.format(i,j) for i,j in zip(mylabels,percent)]

sort_legend = True
if sort_legend:
    patches, labels, dummy =  zip(*sorted(zip(patches, labels, ys),
                                          key=lambda x: x[2],
                                          reverse=True))

plt.legend(patches, labels, loc='upper right', bbox_to_anchor=(-0.1, 1.),
           fontsize=8)
plt.title("FIFA Average value Percentage Per Position in 2021")

plt.show()

Based on three pie graphs above, we can find that striker is the most valuable position compared with other positions (for three years). That's the reason that most players of FIFA are willing to pay a lot to get an outstanding striker. Besides, Center-Back, Gool-keeper, and Center Attacking Midfielder are the second popular positions that have higher average player's value than other positions. Which means those positions players card will have higher value to buy or sell in the future.

In [70]:
def mon_to_num(num):
    if type(num) == int:
        return num
    a = list(num)
    if a[0] == "€":
        a.remove("€")
    if a[-1] == "K":
        a.remove("K")
    if a[-1] == "M":
        a.remove("M")
        a = float("".join(a))
        a = int(a)
        a = str(a)+"000"
        a = list(a)
    b = int("".join(a))
    return b


del_list = list()
for i in range(0,len(data190["Wage"])):
    if type(data190["Wage"][i]) == str and list(data190["Wage"][i])[0] == "€" and list(data190["Wage"][i])[-1] == "K":
        data190["Wage"][i] = mon_to_num(data190["Wage"][i])
    elif type(data190["Wage"][i]) == str and list(data190["Wage"][i])[0] == "€" and list(data190["Wage"][i])[-1] == "M":
        data190["Wage"][i] = mon_to_num(data190["Wage"][i])
    else :
        del_list.append(i)
  
data190 = data190.drop(del_list)
data190.reset_index(inplace=True,drop=True)
data190 = data190.loc[data190["Age"].notnull()]
data190.reset_index(inplace=True,drop=True)
data19_final = data190.loc[:,["Age","Wage","Value"]]

del_list = list()
for i in range(0,len(data19_final["Value"])):
    if type(data19_final["Value"][i]) == str and list(data19_final["Value"][i])[0] == "€" and list(data19_final["Value"][i])[-1] == "K":
        data19_final["Value"][i] = mon_to_num(data19_final["Value"][i])
    elif type(data19_final["Value"][i]) == str and list(data19_final["Value"][i])[0] == "€" and list(data19_final["Value"][i])[-1] == "M":
        data19_final["Value"][i] = mon_to_num(data19_final["Value"][i])
    else :
        del_list.append(i)

data19_final = data19_final.drop(del_list)
data19_final.reset_index(inplace=True,drop=True)
data19_final
<ipython-input-70-d9fdcb1b3217>:22: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[70]:
Age Wage Value
0 31 565 110000
1 33 405 77000
2 26 290 118000
3 27 260 72000
4 27 355 102000
... ... ... ...
17619 19 1 60
17620 19 1 60
17621 16 1 60
17622 17 1 60
17623 16 1 60

17624 rows × 3 columns

In [71]:
def mon_to_num(num):
    if type(num) == int:
        return num
    a = list(num)
    if a[0] == "€":
        a.remove("€")
    if a[-1] == "K":
        a.remove("K")
    if a[-1] == "M":
        a.remove("M")
        a = float("".join(a))
        a = int(a)
        a = str(a)+"000"
        a = list(a)
    b = int("".join(a))
    return b

del_list = list()


for i in range(0,len(data_2["Wage"])):
    if type(data_2["Wage"][i]) == str and list(data_2["Wage"][i])[0] == "€" and list(data_2["Wage"][i])[-1] == "K":
        data_2["Wage"][i] = mon_to_num(data_2["Wage"][i])
    elif type(data_2["Wage"][i]) == str and list(data_2["Wage"][i])[0] == "€" and list(data_2["Wage"][i])[-1] == "M":
        data_2["Wage"][i] = mon_to_num(data_2["Wage"][i])
    else :
        del_list.append(i)
        
data_2 = data_2.drop(del_list)

data_2.reset_index(inplace=True,drop=True)

data_2 = data_2.loc[data_2["Age"].notnull()]

data_2_final = data_2.loc[:,["Age","Wage","Value"]]

del_list = list()
for i in range(0,len(data_2_final["Value"])):
    if type(data_2_final["Value"][i]) == str and list(data_2_final["Value"][i])[0] == "€" and list(data_2_final["Value"][i])[-1] == "K":
        data_2_final["Value"][i] = mon_to_num(data_2_final["Value"][i])
    elif type(data_2_final["Value"][i]) == str and list(data_2_final["Value"][i])[0] == "€" and list(data_2_final["Value"][i])[-1] == "M":
        data_2_final["Value"][i] = mon_to_num(data_2_final["Value"][i])    
    else :
        del_list.append(i)

data_2_final = data_2_final.drop(del_list)

data_2_final.reset_index(inplace=True,drop=True)

data20_fianl = data_2_final
<ipython-input-71-4835b3585793>:23: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

<ipython-input-71-4835b3585793>:42: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

<ipython-input-71-4835b3585793>:40: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [72]:
data21.loc[data21["Wage"].isnull()]  
data21.loc[data21["Value"].isnull()] 
data21.loc[data21["Age"].isnull()]  

data21_final = data21.loc[:,["Age","Wage","Value"]]
In [73]:
dataf = pd.concat((data19_final,data20_final,data21_final),axis=0)
In [74]:
label_need=dataf.keys()
print(label_need)
Index(['Age', 'Wage', 'Value'], dtype='object')
In [75]:
#define x and y
x = dataf[label_need].values[:,0:2]
y = dataf[label_need].values[:,2]
print(x)
print(y)
[['31' 565]
 ['33' 405]
 ['26' 290]
 ...
 [35 0]
 [32 20000]
 [35 6000]]
[110000 77000 118000 ... 0 2200000 110000]
In [76]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=33, test_size=0.25)
#  print(x_test.shape) 
# Train the model
from sklearn.neighbors import KNeighborsRegressor
k = 5
knn = KNeighborsRegressor(k)
knn.fit(x_train, y_train)
Out[76]:
KNeighborsRegressor()
In [77]:
y_pred = knn.predict(x_test) 
plt.figure(figsize=(16, 10), dpi=144)
plt.scatter(x_test[:,0], y_test, c='g', s=100)         
plt.scatter(x_test[:,0], y_pred, c='k')       
plt.axis('tight')
plt.title("KNeighborsRegressor (k = %i)" % k)
plt.xlabel("Age")
plt.ylabel("Value")
plt.show()
In [78]:
plt.figure(figsize=(16, 10), dpi=144)
plt.scatter(x_test[:,1], y_test, c='g', s=100)        
plt.scatter(x_test[:,1], y_pred, c='k')      
plt.axis('tight')
plt.title("KNeighborsRegressor (k = %i)" % k)
plt.xlabel("Wage")
plt.ylabel("Value")
plt.show()

From the Knn regression graph above, We can predict the player's value based on his age and wage.

We were relatively successful in predicting FIFA player's value with a limited number of dataset.

However, a large portion of one player's value depends on his wage and position. Our goal was not to evaluate 'value' stats, but to try to separate positions and other factors, and use only those deemed significant. We believe we accomplished this and have more ideas for improvement.

Nevertheless, while this project is far from perfect, we are satisfied with what we've created, and can proudly display a working analysis with many interesting aspects.

This is our final product for our FIFA Player Value Analysis, and thank you for reading.

Cover Format

In [79]:
%%shell
jupyter nbconvert --to html /content/drive/MyDrive/CMPS3160_Project/Project.ipynb
# Change our notebook format to the html format
[NbConvertApp] Converting notebook /content/drive/MyDrive/CMPS3160_Project/Project.ipynb to html
[NbConvertApp] Writing 2767513 bytes to /content/drive/MyDrive/CMPS3160_Project/Project.html
Out[79]:

Final Plan

We changed some of our previous code in the current project and added missing explanations. We currently spend a lot of time looking for the right data, and we find it very difficult to find very suitable data. In addition, we will continue to use the knowledge learned in class to continue research.