Predicting football players’ market value using Machine Learning

12 min readJun 3, 2023

You can find the full code on my GitHub account.

I. Introduction

Estimating the market value of football players using machine learning serves various purposes and provides numerous benefits to various stakeholders in the football industry:

Player evaluation: Clubs, agents and investors need accurate estimates of a player’s market value to make informed decisions during player transfers, contract negotiations and investment strategies. Machine learning models can analyze historical data, player performance metrics and market trends to provide reliable valuations and help clubs identify potential bargains or overpriced players.
Transfer market efficiency: Predicting market values increases the efficiency of the transfer market. Clubs can avoid overpaying for players by comparing estimated values with asking prices. It enables clubs to negotiate more effectively and allocate their resources efficiently, leading to a more balanced and competitive transfer market.
Risk assessment: Accurate valuation models help assess the risk associated with player investments. By considering factors such as age, performance statistics, injury history and market demand, clubs can assess the potential return on investment. This allows them to reduce financial risks and make strategic decisions when acquiring new players.
Player development: Machine learning models can help identify young talent with high growth potential. By analyzing player attributes, performance data and historical patterns, clubs can identify the candidates most likely to increase in value over time. This can inform player development strategies and youth academy investments.
Data-driven negotiations: Machine learning-based assessments provide objective evidence during player negotiations. Agents and players can use these predictions to support their demands or set realistic expectations during contract negotiations, leading to fairer and more transparent negotiations.
Fantasy sports and betting: Predicting market values can enhance the experience of fantasy football participants and betting enthusiasts. Accurate valuation models can help users make more informed decisions about player selections, team compositions and betting odds, increasing their chances of success.

The application of linear regression and random forest models has demonstrated their effectiveness in estimating the market value of football players. These models take into account a number of relevant characteristics that play an important role in the valuation process and contribute to accurate forecasts.

II. Data Collection and Preparation

The website “https://sofifa.com/" is a comprehensive platform that provides a vast amount of data related to football players. This data encompasses various attributes, including but not limited to “Age” , “Height” ,“Potential”.

Considering all characteristics and observing their relationship to market value is a prudent approach to developing a comprehensive understanding of the factors that influence player valuation. By analyzing a wide range of attributes, you can gain insight into the various dimensions that influence market value in the football industry.

By considering relevant characteristics and analyzing their correlation with market value, you can gain a more holistic understanding of the factors at play. Machine learning techniques such as regression analysis or random forest can be used to quantify the impact of these attributes and develop predictive models that more accurately predict player market values.

Learning how to scrape data from websites can be really challenging, especially for beginners. This is a skill that requires programming knowledge, an understanding of web technologies, and familiarity with scraping tools and libraries. It is not unusual for the initial learning process to take some time.

Here are a few tips that made my web scraping journey easier:

1. Start with the basics: Familiarize yourself with HTML, CSS, and JavaScript, which form the foundation of web pages. Understanding the structure and elements of a web page will be crucial for targeting and extracting the desired data.

2. Choose the right tools: Various scraping tools and libraries are available in different programming languages, such as Python’s BeautifulSoup, Scrapy or Selenium. Research and choose the tool that suits your programming skills and project requirements. I chose BeautifulSoup for my project.

3. Follow tutorials and documentation: Utilize online tutorials, guides and documentation to learn web scraping concepts and best practices. Many websites and forums provide step-by-step instructions on how to scrape data from different types of websites.

4. Start with simple websites: Start by scraping data from simple websites with simple HTML structures. This will help you understand the basics and build confidence before tackling more complex websites.

5. Practice and experiment: The more you practice, the better you will get. Experiment with different scraping techniques, selectors, and data extraction methods. Work on small projects to improve your skills and gradually increase the complexity of the websites you scrape.

With perseverance, practice, and continuous learning, you will become increasingly proficient at web scraping and reduce the time it takes to collect data from websites.

It took me a week to learn how to pull data from the website :D. Anyway i did it and here is my codes. It is currently working (in the year 2023).

def getAndParseURL(url):
    result = requests.get(url, headers={"User-Agent":"Mozilla/5.0"}) # Safari/537.36. Chrome/103.0.0.0
    soup = bts(result.text, "html.parser")
    return soup

pages = ["https://sofifa.com/players?showCol%5B0%5D=pi&showCol%5B1%5D=ae&showCol%5B2%5D=hi&showCol%5B3%5D=wi&showCol%5B4%5D=pf&showCol%5B5%5D=oa&showCol%5B6%5D=pt&showCol%5B7%5D=bo&showCol%5B8%5D=bp&showCol%5B9%5D=gu&showCol%5B10%5D=vl&showCol%5B11%5D=wg&showCol%5B12%5D=rc&showCol%5B13%5D=ta&showCol%5B14%5D=cr&showCol%5B15%5D=fi&showCol%5B16%5D=he&showCol%5B17%5D=sh&showCol%5B18%5D=vo&showCol%5B19%5D=ts&showCol%5B20%5D=dr&showCol%5B21%5D=cu&showCol%5B22%5D=fr&showCol%5B23%5D=lo&showCol%5B24%5D=bl&showCol%5B25%5D=to&showCol%5B26%5D=ac&showCol%5B27%5D=sp&showCol%5B28%5D=ag&showCol%5B29%5D=re&showCol%5B30%5D=ba&showCol%5B31%5D=tp&showCol%5B32%5D=so&showCol%5B33%5D=ju&showCol%5B34%5D=st&showCol%5B35%5D=sr&showCol%5B36%5D=ln&showCol%5B37%5D=te&showCol%5B38%5D=ar&showCol%5B39%5D=in&showCol%5B40%5D=po&showCol%5B41%5D=vi&showCol%5B42%5D=pe&showCol%5B43%5D=cm&showCol%5B44%5D=td&showCol%5B45%5D=ma&showCol%5B46%5D=sa&showCol%5B47%5D=sl&showCol%5B48%5D=tg&showCol%5B49%5D=gd&showCol%5B50%5D=gh&showCol%5B51%5D=gc&showCol%5B52%5D=gp&showCol%5B53%5D=gr&showCol%5B54%5D=tt&showCol%5B55%5D=bs&showCol%5B56%5D=ir&showCol%5B57%5D=pac&showCol%5B58%5D=sho&showCol%5B59%5D=pas&showCol%5B60%5D=dri&showCol%5B61%5D=def&showCol%5B62%5D=phy%3D&offset=0"]
for page in range(0,3060,60):
    pages.append("https://sofifa.com/players?showCol%5B0%5D=pi&showCol%5B1%5D=ae&showCol%5B2%5D=hi&showCol%5B3%5D=wi&showCol%5B4%5D=pf&showCol%5B5%5D=oa&showCol%5B6%5D=pt&showCol%5B7%5D=bo&showCol%5B8%5D=bp&showCol%5B9%5D=gu&showCol%5B10%5D=vl&showCol%5B11%5D=wg&showCol%5B12%5D=rc&showCol%5B13%5D=ta&showCol%5B14%5D=cr&showCol%5B15%5D=fi&showCol%5B16%5D=he&showCol%5B17%5D=sh&showCol%5B18%5D=vo&showCol%5B19%5D=ts&showCol%5B20%5D=dr&showCol%5B21%5D=cu&showCol%5B22%5D=fr&showCol%5B23%5D=lo&showCol%5B24%5D=bl&showCol%5B25%5D=to&showCol%5B26%5D=ac&showCol%5B27%5D=sp&showCol%5B28%5D=ag&showCol%5B29%5D=re&showCol%5B30%5D=ba&showCol%5B31%5D=tp&showCol%5B32%5D=so&showCol%5B33%5D=ju&showCol%5B34%5D=st&showCol%5B35%5D=sr&showCol%5B36%5D=ln&showCol%5B37%5D=te&showCol%5B38%5D=ar&showCol%5B39%5D=in&showCol%5B40%5D=po&showCol%5B41%5D=vi&showCol%5B42%5D=pe&showCol%5B43%5D=cm&showCol%5B44%5D=td&showCol%5B45%5D=ma&showCol%5B46%5D=sa&showCol%5B47%5D=sl&showCol%5B48%5D=tg&showCol%5B49%5D=gd&showCol%5B50%5D=gh&showCol%5B51%5D=gc&showCol%5B52%5D=gp&showCol%5B53%5D=gr&showCol%5B54%5D=tt&showCol%5B55%5D=bs&showCol%5B56%5D=ir&showCol%5B57%5D=pac&showCol%5B58%5D=sho&showCol%5B59%5D=pas&showCol%5B60%5D=dri&showCol%5B61%5D=def&showCol%5B62%5D=phy%3D&offset="+str(page))
    
pages

players = []
for page in pages:
    html = getAndParseURL(page)
    table = html.find("table", {"class": "table table-hover persist-area"})
    for row in table.find_all("tr")[1:]:
        cols = row.find_all("td")
        player = {"name": cols[1].get_text().strip()} # Take the player's name and keep it as a dictionary
        for col in cols[2:]: # Cycle all columns after the player's name
            header = table.find_all("th")[cols.index(col)].get_text().strip() # Get column header
            player[header] = col.get_text().strip() # Get the column value and add it to the player information
        players.append(player)
    time.sleep(1)
df = pd.DataFrame(players) #it makes dataframe

Also, here is the link all the code.

At the end i finally got the dataset! Here is the link.

Handling different value formats, such as values ending in “M” (million) or “K” (thousand), is a common challenge when digging and processing numeric data. You can follow the steps below to convert these values into a consistent format:

1. Determine the value format: As you mine data, determine how values are represented on the website. Look for patterns or indicators that show that a value is in millions or thousands (for example, the suffix “M” or “K”).

2. Remove non-numeric characters: Remove non-numeric characters (e.g. “M”, “K”, currency symbols) from the scraped value using string manipulation or regular expressions. This will leave you with the numeric part of the value.

3. Convert to numeric format: Once you have the numeric part of the value, convert it to a numeric format in your programming language. For example, in Python, you can use `float()` or `int()` to convert the string to a float or integer respectively.

4. Set the magnitude: Depending on the value format, multiply the converted numeric value by the appropriate factor to set the magnitude. For values ending in “M”, multiply by 1 million (for example, 25.5*1000000 = 25500000). For values ending in “K”, multiply by 1000 (for example, 25.5*1000 = 25500).

By performing these steps consistently, you can convert the values scraped into a unified numeric format, regardless of whether they were originally represented in millions or thousands. This allows values to be more easily analyzed, calculated and compared.

When working with scraped data, it is advisable to verify the results and address any unexpected formats or edge cases that may arise during the scraping process. Also, make sure that legal or ethical considerations are followed when scraping data from websites.

There were some more data that needed to be corrected

All the code can be found in this link .

III. Exploratory Data Analysis (EDA)

I intended to conduct a comprehensive analysis of the data, focusing on examining the highest-valued players and their positions.

There is a boxen plot using the seaborn library, visualizing the distribution of values.

Their overall ratings — Values distribution

Overall, dataset was pretty clean and good. OK, there were some outliers but besides that perfect.

IV. Feature Engineering

In my research process, I decided to remove certain features from my data set by utilizing my domain knowledge and correlation analyses.

selected_columns = [‘name’,’Age’,’Dribbling / Reflexes’, ‘Passing / Kicking’, ‘Shooting / Handling’, ‘International reputation’, ‘Total mentality’, ‘Shot power’, ‘Total power’, ‘Ball control’, ‘Finishing’,’Value’]

Once I had narrowed down my data, I looked at how they relate to each other.

Linear regression seemed to make a lot of sense when I looked at the distributions.

I did OLS Regression and evaluated the results. My R2 score was quite low and I had to find a way to raise it.

The first thing I noticed was that the market values were not normally distributed. I took the logarithm and re-evaluated the results.

mask = new_df['Value'] > 0 # Create a boolean mask for non-zero values
df1['Value'] = np.log(new_df.loc[mask, 'Value']) # Apply log only to non-zero values

As we can see R2 score has risen!

Maybe some techniques can help raise the R2 score, like one hot encoding to team names, maybe you can do this :D

After web scraping, EDA, and feature engineering, respectively, it’s time to hands-on ML.

V. Machine Learning Model Building

In this study, I analyzed my dataset using linear regression and random forest methods and compared these two methods. First, I divided my dataset into features (X) and the value to be predicted (y).

X_train = df1.drop(['name', 'Log Market Value'], axis=1)
y_train = df1['Log Market Value']

I then tried regression,

reg = LinearRegression()
reg.fit(X_train, y_train)

afterwards I wondered about the coefficients. you can see the outputs in #!

model = reg.fit(X,y)
model.intercept_ #it returns 5.766291291609278
model.coef_ #it returns 
"""
array([-6.81489648e-02,  6.23469638e-02,  1.84345146e-02,  3.12264893e-02,
        4.18442200e-01,  2.79180750e-04, -2.86286267e-02,  1.55685585e-02,
        3.46522361e-02, -4.43601416e-02])

and the thing I’m most curious about is the accuracy rate;

reg.score(X_train, y_train)
#0.696160575702947

It was 0.69 and it was quite low, I could have looked at feature engineering again to raise it or I could have tried another model, so I wanted to try another model.

forest = RandomForestRegressor()

forest.fit(X_train, y_train)
forest.score(X_train, y_train)
#0.9760774006685728

Wow! 0.97 is a really nice score!

Showing the values I obtained from linear regression and random forest in a data frame;

# Assuming you have trained your model and obtained predictions on the testing dataset
predictions_forest = forest.predict(X_test)
predictions_reg = reg.predict(X_test)
# Create a new DataFrame for predictions and actual values
results_df = pd.DataFrame({'Predicted Market Value (Random Forest)': predictions_forest,
                           'Predicted Market Value (Linear Regression)': predictions_reg,
                           'Actual Market Value': y_test})

I then applied this method to the test dataset

Actually, that was it, but since this was a training exercise (self-training) I wanted to apply other exercises I learned from the hands-on machine learning book. This part is a bit extra.

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

# Create a Random Forest regressor
rf = RandomForestRegressor()

# Perform cross-validation
rf_score = cross_val_score(rf, X, y, cv=10, scoring='r2').mean()
rf_score #0.6934855550336881
rf.fit(X,y)

rf_score = 0.69, very similar in linear regression.

from sklearn.model_selection import train_test_split

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# X_train: Training features
# y_train: Training labels
# X_val: Validation features
# y_val: Validation labels

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# Create a Random Forest regressor
rf = RandomForestRegressor()

# Define the hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10]
}

# Create the GridSearchCV object
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, scoring='r2', cv=5)

# Fit the grid search to the training data
grid_search.fit(X_train, y_train)

# Access the best model
best_model = grid_search.best_estimator_

# Evaluate the best model
best_model_score = best_model.score(X_val, y_val)
print("Best Model Score:", best_model_score)

# Make predictions using the best model
predictions = best_model.predict(X_test)

Best Model Score: 0.8241750987612341

from sklearn.metrics import mean_squared_error

# Make predictions on the test set
y_pred = best_model.predict(X_test)

# Evaluate the predictions using mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

#Mean Squared Error: 0.33293666406282296

from sklearn.metrics import mean_squared_error, r2_score

# Make predictions on the test set
y_pred = best_model.predict(X_test)

# Evaluate the predictions using mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Evaluate the predictions using R^2 score
r2 = r2_score(y_test, y_pred)
print("R^2 Score:", r2)

#Mean Squared Error: 0.33293666406282296
#R^2 Score: 0.8241750987612341

As can be seen, when GridSearchCV was done, the R2 score grew!

Afterward, I wanted to visualize how much each feature contributed.

import matplotlib.pyplot as plt

# Get feature importances
importances = best_model.feature_importances_

# Sort feature importances in descending order
indices = np.argsort(importances)[::-1]

# Plot feature importances
plt.figure(figsize=(10, 6))
plt.title("Feature Importance")
plt.bar(range(X_train.shape[1]), importances[indices], align='center')
plt.xticks(range(X_train.shape[1]), X_train.columns[indices], rotation=90)
plt.tight_layout()
plt.show()

Finally, we come to the last process, randomly selecting a player and seeing if our model can hit the market value.

I chose the person with index number 1000, which is E. Bardhi.

I entered the player’s characteristics in my model in dictionary format.

While the logarithmized market value of the player was 15.34, my model estimated it as 15.65, which is pretty close :D

You can also enter your own characteristics and see how much your market value would be if you were a football player 😆

VII. Conclusion and Future Work

- In this study, the prediction of the market value of the football players whose specific features were entered was performed.

- It turns out that “ball control” is more important than other features, according to my feature selection.

-To take this project further, more attention could be paid to the feature engineering parts, for example, as I mentioned in the article, one-hot encoding could be used to infer teams or mathematical operations could be done to increase the correlation between some features, like averaging two columns. And as a further future project, we could publish this model as a website, with a simple interface that would allow users, such as coaches, to know in advance how much market value their players will have in a completely online environment.

VIII. Acknowledgments

While doing this project, I got help from many internet resources, especially I applied what I learned from the “hands-on machine learning” book. I’ve looked at a lot of similar work on this project and deduced what people did differently from me, and I believe I’ve come up with the best work for my level.

IX. Thanks!

Thank you very much for reading my article. I continue to improve myself day by day in the field of machine learning. Of course, I have mistakes or confusing explanations, and I continue to work on them and bring out the best version of myself.

Big thanks again! 💙