Group Member: Haochen Chen, Maria Chen
Dataset:
Kaggle Competition Datasets
Introduction of Project:
Now people pay more and more attention to the changes in football. FIFA is the International Association Football Federation, an organization to ensure fair competition for players from all countries. FIFA released the football game of the same name, and all the data in the game come from real data. Each player in the game will have different worth, salary and skill rating. Players can choose to buy different player cards to form their own team to win the game. This project mainly analyzes the factors that change the value of players to help users get the highest value player cards at the lowest cost. We use the three year datas, cause we want to see whether or not Covide affect the players' value, and what will change accross these three years.
Simple Timeline for Project:
Our team will have a meeting on Fridays every week to discuss each other's completion, problems encountered and solutions. Maria Chen will be mainly responsible for the analysis of Player in the project, and Haochen Chen will be mainly responsible for the analysis of Team in the project. After both parties have completed the analysis of their respective databases, we will integrate the data from our analysis and finally work together to predict the future performance of players and teams.
Connect and Import connect with google drive, and import the basic funtion.
# Connect our notebook to the Drive
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/CMPS3160_Project
#Import some package we will use in the future process
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns; sns.set_theme()
from scipy import stats, integrate
from collections import Counter
from sklearn.utils import shuffle
#origanl table FIFA 19-21 data; data 1 = FIFA 19 data
data190 = pd.read_csv('/content/drive/MyDrive/CMPS3160_Project/data/1.csv')
data_2=pd.read_csv('/content/drive/MyDrive/CMPS3160_Project/data/2.csv')
data210 = pd.read_csv('/content/drive/MyDrive/CMPS3160_Project/data/3.csv')
# Save the change csv to the new csv which not include the string in the Value, Age, Wage
path1 = "/content/drive/MyDrive/CMPS3160_Project/data/19.csv"
data19 = pd.read_csv(path1, sep=",",encoding = "ISO-8859-1")
path2 = "/content/drive/MyDrive/CMPS3160_Project/data/20.csv"
data20 = pd.read_csv(path2, sep=",",encoding = "ISO-8859-1")
path3 = "/content/drive/MyDrive/CMPS3160_Project/data/21.csv"
data21 = pd.read_csv(path3, sep=",",encoding = "ISO-8859-1")
We have collected FIFA player data over the past three years to analyze what factors affect a player's value. The main data we need are that payers' name, age, value, wage, country, club, position and overall score. The overall score is the FIFA game scording system. There are six main criteria which appear alongside the overall score: speed, shooting, passing, defending, dribbling and physicality.
Key Factor Definition: Value: Value of a soccer player card in FIFA is an estimate of the amount for which the owner can sell the player card to another user. Or the amount for which the user is willing to pay for the player card. Wage: Wage of a soccer player is the amount of money that is regularly paid to them by the soccer club they work for.
Table1: (data19) FIFA Players 2019 data
data19.head()
Table 2: (data20) FIFA Players 2020 data
data20.head()
Table 3: (data21) FIFA Player 2020 data
data21.head()
Cause we want to anlysis which factors will affect the players' value change. First, we need to clean the data, which make us easy to anlysis the data in the future steps. We will drop the NaN Age, Country, Club, Value, Wage. This step is very important. Because in the orignal Excel sheet, like the unknow age will appear 0. These data will affect our caculation in the later process.
data19['Age'] = data19['Age'].drop(0)
data20['Age'] = data20['Age'].drop(0)
data21['Age'] = data21['Age'].drop(0)
data19['Nationality'].fillna('No', inplace = True)
data20['Country'].fillna('No', inplace = True)
data21['Nationality'].fillna('No', inplace = True)
data19['Club'].fillna('No', inplace = True)
data20['Club'].fillna('No', inplace = True)
data21['Club'].fillna('No', inplace = True)
data19["Value"] = data19.Value.replace(0,np.nan)
data20["Value"] = data20.Value.replace(0,np.nan)
data21["Value"] = data21.Value.replace(0,np.nan)
data19["Wage"] = data19.Wage.replace(0,np.nan)
data20["Wage"] = data20.Wage.replace(0,np.nan)
data21["Wage"] = data21.Wage.replace(0,np.nan)
In the next step, we want to merge all useful data into our create own table. We want a table will have players' Name, every year Age, Club, Nationality, Value, Wage. Because we think these factor may affect the value change in the future. This step can help us more easier to anlysis the data in the future steps.
First, We create a new table call "information19", this table mainly display the players' information in the FIFA 2019.
information19 = pd.DataFrame()
information19["Name"] = data19["Name"]
information19["19Age"]= data19["Age"]
information19["19Club"] = data19["Club"]
information19["19Nationality"] = data19["Nationality"]
information19["19Value"] = data19["Value"]
information19["19Wage"] = data19["Wage"]
information19.head()
Second, We create a new table call "information20", this table mainly display the players' information in the FIFA 2020.
information20 = pd.DataFrame()
information20["Name"] = data20["Name"]
information20["20Age"]= data20["Age"]
information20["20Club"] = data20["Club"]
information20["20Nationality"] = data20["Country"]
information20["20Value"] = data20["Value"]
information20["20Wage"] = data20["Wage"]
information20.head()
Third, We create a new table call "information21", this table mainly display the players' information in the FIFA 2021.
information21 = pd.DataFrame()
information21["Name"] = data21["Name"]
information21["21Age"] = data21["Age"]
information21["21Club"] = data21["Club"]
information21["21Nationality"] = data21["Nationality"]
information21["21Value"] = data21["Value"]
information21["21Wage"] = data21["Wage"]
information21.head()
We will use these three tables to do the data compare. We want to see whether or not the covid affect the players' value change during this three year. Next step we will do some simple data anlysis about comparing players' three year factor change.
This step we want to compare the players’ three year age change. When do this part data analysis, we discover a problem show in the original tabular data. Some players’ age change will not increase with year change. This may be due to statistical errors in the original tabular data collection. But we mainly want to see the age distribution of the players. What is the age range of most players, and how old are the oldest and youngest players?
fig, axes = plt.subplots(1,3,figsize = (12,4),sharey = True)
fig.tight_layout(h_pad =4)
sns.set(color_codes=True)
information19["19Age"].plot.hist(ax= axes[0]).set_title("19Age")
information20["20Age"].plot.hist(ax= axes[1]).set_title("20Age")
information21["21Age"].plot.hist(ax= axes[2]).set_title("21Age")
In these three tables, we can see how often the players are distributed over three years of age. We can find that basically the age with the largest number of players every year is around 25 years old. The oldest player will not exceed 45 years old, and the youngest player is basically 16 years old. This shows that although it changes every year, there will always be new young players joining. Football has relatively high age requirements for players. Will the value of young players be lower than that of experienced players? Is the player's age directly proportional to the player's value? We will conduct a more in-depth analysis later.
Next we also want to see whether the salaries of players have changed a lot under the influence of covid. We first normalized the salaries to ensure that the final chart display allows people to see the distribution of players' salaries at a glance.
fig, axes = plt.subplots(1,3,figsize = (12,4),sharey = True)
fig.tight_layout(h_pad =4)
sns.set(color_codes=True)
log19Wage = np.log(information19["19Wage"])
log20Wage = np.log(information20["20Wage"])
log21Wage = np.log(information21["21Wage"])
log19Wage.plot.hist(ax= axes[0]).set_title("19Wage")
log20Wage.plot.hist(ax= axes[1]).set_title("20Wage")
log21Wage.plot.hist(ax= axes[2]).set_title("21Wage")
According to these three tables, we can analyze the salary distribution changes in the past three years. We can find that in 2019 and 20, most of the salary distribution is gathered on the far left. Perhaps because of the epidemic, the salaries of players have not changed much, and more players' salaries have been reduced. By 21 years, we can find that the salaries of many originally low-paid players have increased, but the maximum salary of players has decreased.
Next, we want to analyze the changes in the value of players in the past three years. Whether the player's value has been reduced or increased by the impact of covid. At the same time, we also want to analyze the value distribution of players, trying to find out whether changes in player values are related to changes in other factors.
fig, axes = plt.subplots(1,3,figsize = (12,4),sharey = True)
fig.tight_layout(h_pad =4)
sns.set(color_codes=True)
log19Value = np.log(information19["19Value"])
log20Value = np.log(information20["20Value"])
log21Value = np.log(information21["21Value"])
log19Value.plot.hist(ax = axes[0]).set_title("19Value")
log20Value.plot.hist(ax = axes[1]).set_title("20Value")
log21Value.plot.hist(ax = axes[2]).set_title("21Value")
We first normalize the player values. Let's make our chart display easier to see the exact distribution. We find that the value of players generally increases with age for the most part. The 20-year increase in player value is not obvious. Perhaps because of covid, most players did not have the opportunity to endorse or play, so the value of the players has not changed. In 21 years, most low-value players have grown.
We also speculate whether players from some countries will be worth more. Country is likely to be an important factor affecting player value changes.
information19.groupby(["19Nationality"])["19Value"].mean().plot.bar(figsize=(60,25))
information20.groupby(["20Nationality"])["20Value"].mean().plot.bar(figsize=(60,25))
information21.groupby(["21Nationality"])["21Value"].mean().plot.bar(figsize=(60,25))
When we analyze the relationship between player country and player value, we try to first calculate the average value of all players in each country. At this time, we found that some countries have a small number of players, but the value of each player is very high. This will cause the average value of all players in the country to be high, and it is impossible to accurately analyze which countries' players have higher value. Because the number of players in each country is not guaranteed to be consistent. In addition, we also found that some countries have missing player value data. In the following analysis, we will solve this problem to analyze the relationship between player country and value.
Next, we will try to find out the correlation between the player's value and his age and wage. We firstly think that there are some close connections between the player's value and his age and wage. We will use several graphs to prove and illustrate our ideas.
features = ["Value","Age","Wage"]
corr = data19[features].corr()
corr
sns.heatmap(corr)
From the correlation graph, we can easily find out the results that the player's value has high correlation with the player's wage, but it has almost no connection to the player's age.
# Save the change csv to the new csv which not include the string in the Value, Age, Wage
path1 = "/content/drive/MyDrive/CMPS3160_Project/data/19.csv"
data19 = pd.read_csv(path1, sep=",",encoding = "ISO-8859-1")
path2 = "/content/drive/MyDrive/CMPS3160_Project/data/20.csv"
data20 = pd.read_csv(path2, sep=",",encoding = "ISO-8859-1")
path3 = "/content/drive/MyDrive/CMPS3160_Project/data/21.csv"
data21 = pd.read_csv(path3, sep=",",encoding = "ISO-8859-1")
import plotly.express as px
nat_cnt=data19.groupby('Nationality').apply(lambda x:x['Name'].count()).reset_index(name='Counts')
nat_cnt.sort_values(by='Counts',ascending=False,inplace=True)
top_20_nat_cnt=nat_cnt[:20]
fig=px.bar(top_20_nat_cnt,x='Nationality',y='Counts',color='Counts',title='Nationwise Representation in the FIFA Game')
fig.show()
import plotly.express as px
cost_prop=data21[['Name','Club','Nationality','Wage','Value','Position']]
fig=px.scatter(cost_prop,x='Value',y='Wage',color='Value',size='Wage',hover_data=['Name','Club','Nationality','Position'],title='Value vs Wage Presentation of all the Players')
fig.show()
From the scatter graph above, we can see that the player who has higher wage will have the higher value in the FIFA, also the player who has lower wage will have the lower value in the FIFA. (The graph shows an linear interaction with stimated positive slope.)
After finding out the relationship between value and wage, we want to explore the relationships between the player's value and his position. We want to find out which position of the player has higher value than others? This provides us with a basis going forward. Let’s see how every position takes a bite out of the pie.
y = data19.groupby("Position")[["Value"]].sum()
y = y.reset_index()
y.groupby("Position")[["Value"]].sum()
y.sort_values("Value",ascending=False,inplace=True)
mylabels = y["Position"]
ys = y["Value"]
percent = 100.*ys/ys.sum()
patches, texts = plt.pie(ys, startangle=90, radius=1.2,shadow = True)
labels = ['{0} - {1:1.2f} %'.format(i,j) for i,j in zip(mylabels,percent)]
sort_legend = True
if sort_legend:
patches, labels, dummy = zip(*sorted(zip(patches, labels, ys),
key=lambda x: x[2],
reverse=True))
plt.legend(patches, labels, loc='upper right', bbox_to_anchor=(-0.1, 1.),
fontsize=8)
plt.title("FIFA Average value Percentage Per Position in 2019")
plt.show()
y = data20.groupby("BP")[["Value"]].sum()
y = y.reset_index()
y.groupby("BP")[["Value"]].sum()
y.sort_values("Value",ascending=False,inplace=True)
mylabels = y["BP"]
ys = y["Value"]
percent = 100.*ys/ys.sum()
patches, texts = plt.pie(ys, startangle=90, radius=1.2,shadow = True)
labels = ['{0} - {1:1.2f} %'.format(i,j) for i,j in zip(mylabels,percent)]
sort_legend = True
if sort_legend:
patches, labels, dummy = zip(*sorted(zip(patches, labels, ys),
key=lambda x: x[2],
reverse=True))
plt.legend(patches, labels, loc='upper right', bbox_to_anchor=(-0.1, 1.),
fontsize=8)
plt.title("FIFA Average value Percentage Per Position in 2020")
plt.show()
y = data21.groupby("Position")[["Value"]].sum()
y = y.reset_index()
y.groupby("Position")[["Value"]].sum()
y.sort_values("Value",ascending=False,inplace=True)
mylabels = y["Position"]
ys = y["Value"]
percent = 100*ys/ys.sum()
patches, texts = plt.pie(ys, startangle=90, radius=1.2,shadow = True)
labels = ['{0} - {1:1.2f} %'.format(i,j) for i,j in zip(mylabels,percent)]
sort_legend = True
if sort_legend:
patches, labels, dummy = zip(*sorted(zip(patches, labels, ys),
key=lambda x: x[2],
reverse=True))
plt.legend(patches, labels, loc='upper right', bbox_to_anchor=(-0.1, 1.),
fontsize=8)
plt.title("FIFA Average value Percentage Per Position in 2021")
plt.show()
Based on three pie graphs above, we can find that striker is the most valuable position compared with other positions (for three years). That's the reason that most players of FIFA are willing to pay a lot to get an outstanding striker. Besides, Center-Back, Gool-keeper, and Center Attacking Midfielder are the second popular positions that have higher average player's value than other positions. Which means those positions players card will have higher value to buy or sell in the future.
def mon_to_num(num):
if type(num) == int:
return num
a = list(num)
if a[0] == "€":
a.remove("€")
if a[-1] == "K":
a.remove("K")
if a[-1] == "M":
a.remove("M")
a = float("".join(a))
a = int(a)
a = str(a)+"000"
a = list(a)
b = int("".join(a))
return b
del_list = list()
for i in range(0,len(data190["Wage"])):
if type(data190["Wage"][i]) == str and list(data190["Wage"][i])[0] == "€" and list(data190["Wage"][i])[-1] == "K":
data190["Wage"][i] = mon_to_num(data190["Wage"][i])
elif type(data190["Wage"][i]) == str and list(data190["Wage"][i])[0] == "€" and list(data190["Wage"][i])[-1] == "M":
data190["Wage"][i] = mon_to_num(data190["Wage"][i])
else :
del_list.append(i)
data190 = data190.drop(del_list)
data190.reset_index(inplace=True,drop=True)
data190 = data190.loc[data190["Age"].notnull()]
data190.reset_index(inplace=True,drop=True)
data19_final = data190.loc[:,["Age","Wage","Value"]]
del_list = list()
for i in range(0,len(data19_final["Value"])):
if type(data19_final["Value"][i]) == str and list(data19_final["Value"][i])[0] == "€" and list(data19_final["Value"][i])[-1] == "K":
data19_final["Value"][i] = mon_to_num(data19_final["Value"][i])
elif type(data19_final["Value"][i]) == str and list(data19_final["Value"][i])[0] == "€" and list(data19_final["Value"][i])[-1] == "M":
data19_final["Value"][i] = mon_to_num(data19_final["Value"][i])
else :
del_list.append(i)
data19_final = data19_final.drop(del_list)
data19_final.reset_index(inplace=True,drop=True)
data19_final
def mon_to_num(num):
if type(num) == int:
return num
a = list(num)
if a[0] == "€":
a.remove("€")
if a[-1] == "K":
a.remove("K")
if a[-1] == "M":
a.remove("M")
a = float("".join(a))
a = int(a)
a = str(a)+"000"
a = list(a)
b = int("".join(a))
return b
del_list = list()
for i in range(0,len(data_2["Wage"])):
if type(data_2["Wage"][i]) == str and list(data_2["Wage"][i])[0] == "€" and list(data_2["Wage"][i])[-1] == "K":
data_2["Wage"][i] = mon_to_num(data_2["Wage"][i])
elif type(data_2["Wage"][i]) == str and list(data_2["Wage"][i])[0] == "€" and list(data_2["Wage"][i])[-1] == "M":
data_2["Wage"][i] = mon_to_num(data_2["Wage"][i])
else :
del_list.append(i)
data_2 = data_2.drop(del_list)
data_2.reset_index(inplace=True,drop=True)
data_2 = data_2.loc[data_2["Age"].notnull()]
data_2_final = data_2.loc[:,["Age","Wage","Value"]]
del_list = list()
for i in range(0,len(data_2_final["Value"])):
if type(data_2_final["Value"][i]) == str and list(data_2_final["Value"][i])[0] == "€" and list(data_2_final["Value"][i])[-1] == "K":
data_2_final["Value"][i] = mon_to_num(data_2_final["Value"][i])
elif type(data_2_final["Value"][i]) == str and list(data_2_final["Value"][i])[0] == "€" and list(data_2_final["Value"][i])[-1] == "M":
data_2_final["Value"][i] = mon_to_num(data_2_final["Value"][i])
else :
del_list.append(i)
data_2_final = data_2_final.drop(del_list)
data_2_final.reset_index(inplace=True,drop=True)
data20_fianl = data_2_final
data21.loc[data21["Wage"].isnull()]
data21.loc[data21["Value"].isnull()]
data21.loc[data21["Age"].isnull()]
data21_final = data21.loc[:,["Age","Wage","Value"]]
dataf = pd.concat((data19_final,data20_final,data21_final),axis=0)
label_need=dataf.keys()
print(label_need)
#define x and y
x = dataf[label_need].values[:,0:2]
y = dataf[label_need].values[:,2]
print(x)
print(y)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=33, test_size=0.25)
# print(x_test.shape)
# Train the model
from sklearn.neighbors import KNeighborsRegressor
k = 5
knn = KNeighborsRegressor(k)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
plt.figure(figsize=(16, 10), dpi=144)
plt.scatter(x_test[:,0], y_test, c='g', s=100)
plt.scatter(x_test[:,0], y_pred, c='k')
plt.axis('tight')
plt.title("KNeighborsRegressor (k = %i)" % k)
plt.xlabel("Age")
plt.ylabel("Value")
plt.show()
plt.figure(figsize=(16, 10), dpi=144)
plt.scatter(x_test[:,1], y_test, c='g', s=100)
plt.scatter(x_test[:,1], y_pred, c='k')
plt.axis('tight')
plt.title("KNeighborsRegressor (k = %i)" % k)
plt.xlabel("Wage")
plt.ylabel("Value")
plt.show()
From the Knn regression graph above, We can predict the player's value based on his age and wage.
We were relatively successful in predicting FIFA player's value with a limited number of dataset.
However, a large portion of one player's value depends on his wage and position. Our goal was not to evaluate 'value' stats, but to try to separate positions and other factors, and use only those deemed significant. We believe we accomplished this and have more ideas for improvement.
Nevertheless, while this project is far from perfect, we are satisfied with what we've created, and can proudly display a working analysis with many interesting aspects.
This is our final product for our FIFA Player Value Analysis, and thank you for reading.
%%shell
jupyter nbconvert --to html /content/drive/MyDrive/CMPS3160_Project/Project.ipynb
# Change our notebook format to the html format
We changed some of our previous code in the current project and added missing explanations. We currently spend a lot of time looking for the right data, and we find it very difficult to find very suitable data. In addition, we will continue to use the knowledge learned in class to continue research.