As part of my data-science career track bootcamp, I had to complete a few personal capstones. For this particular capstone, I opted to focus on building something I personally care about – what better way to learn and possibly build something valuable than by working on a passion project.

Additionally, I believe too much time is spent trying to impress other people, and life is too short to not work on things you personally find interesting. In light of this, I decided to build a boxing prediction web app.

One that would show would be users the probability of a given outcome in a fight. To replicate the reality of boxing were people in different weight classes cannot fight each other, the fight in question has to be between fighters in the same weight class.

In my previous article I wrote about the process of acquiring most of the data necessary to train and build the model to carry out these predictions. I also got into the seemingly laborious but quite interesting process of cleaning up the data, presenting it in the format of an interactive dashboard and deploying it on Heroku.

The full article is available on Hackernoon. I also chose to build a custom top ten ranking for each division based on wins and to a smaller extent the caliber of opponents a given boxer beat. I did not however use an elo-rank type scoring system, which in all fairness is probably a significantly more realistic representation of the boxing landscape.

More clean up and transformation

For the purpose of building the model, I had to transform the data I already had in such a way that i would have a row for each bout and the relevant stats for Boxer A and the opponent. This process involved using list comprehension to create a list of all the column names relevant to a specific stat or opponent.

cols = ['secondBoxer'+str(i) for i in range(1, 85)]
two = ['secondBoxerWeight'+str(i) for i in range(1,85)]

For each list of columns for each stat related to a bout, I used unpivoted the columns using the boxer’s name, sex, division and global id as the identifier variables and then concatenated these dataframes into one. With the end result being a ‘long dataframe’ with each row showing data related to each fight a given fighter fought along with the opponent’s stats and stats relevant to that particular bout.

concated = pd.melt(data,id_vars=['name','global_id','sex','division'], value_vars = cols,var_name='label',value_name='opposition')
concated_two = pd.melt(data,id_vars=['name','global_id','sex','division'],value_vars=two,var_name='weightb_label',value_name='opp_weight').drop(columns=['global_id','name','sex','division'])
concated_three = pd.melt(data,id_vars=['name','global_id','sex','division'],value_vars=three,var_name='last6_label',value_name='opp_last6').drop(columns=['global_id','name','sex','division'])
...#merge all 
fully_merged = pd.concat([concated,concated_two,concated_three,concated_four,concated_five,
fully_merged = fully_merged.set_index('name')

A lot of sklearn models typically require features to be numerical as opposed to categorical, as such I found it preemptive to convert all categorical variables to numerical variables. In one column for example, I had the opponent’s results from their previous 6 matches. In such scenarios I searched for the string win. I counted the total win strings I found and multiplied the result by 10, by 5 for each string containing the word draw and 0 value for each loss, giving me a numerical score as a representation of the opponent’s last 6 bouts.

#converting last 6 fights to points
fully_merged['opp_last6'] = fully_merged.opp_last6.str.count('win')*10+fully_merged.opp_last6.str.count('draw')*5+fully_merged.opp_last6.str.count('loss')*-5

One of the columns I had contained a list of lists, with each nested list containing the judges’ scores for both the boxer and the opponent. Cleaning this specific column, involved extracting all numerical values, un-stacking the column into 6 columns for each score, renaming the multi-level columns created and then assigning the columns to the ‘main dataframe’. I followed the same logic in unpacking the number of rounds won for each boxer.

ref_points =fully_merged.judge.str.extractall(r'(\b\d+\b)').unstack().reindex(fully_merged.index)      
ref_points.columns ='{0[0]}_{0[1]}'.format)
fully_merged[['judge1boxer','judge1opp','judge2boxer','judge2opp','judge3boxer','judge3opp']] = ref_points[['0_0','0_1','0_2','0_3','0_4','0_5']]

While its useful to have stats for each boxer and potentially use those as features for my model, I also wanted to get the same attributes and stats for each opponent. In excel there is a function referred to as a vlookup, with its purpose being to retrieve a value from a specific column. For example, we can look for the name of a person, or a string that resembles this name and then return any value associated with that person’s name in other columns. Using similar logic. I used the map key to ‘map’ the opponent’s name to the name column. For each match I retrieved the weight, height and other stats associated with the matching name.

Enriching my dataset with more data

While I already had quite a fair amount of data to work with, as an avid boxing fan I knew that there was more data out there that could potentially help me a better model. The number of punches thrown and landed in a given fight could be aggregated into interesting features showing both the overall accuracy of a given boxer and how well the boxer’s defense typically stands against other opponents- judged by the percentage of punches landed against said boxer.

What made this process a little tricky at first is that, to get the punch stats for each boxer. I initially envisioned simulating clicking on the view/download button which would reveal the boxing stats for each fight. In order to do this I experimented with using selenium to input a given boxer’s full name and then simulate clicking through the view/download stats button and reading the text with the punch stats for each bout a given boxer fought. However, I eventually opted to use the request library to retrieve data from the secondary web address with the punch stats for each fighter, this proved to be a more feasible option.

I used the requests library to firstly create a list of id’s for each boxer available in the data source and then iterate through the list and return, for each boxer id, the fights recorded along with the punch stats for each fight. While I could have broken the punch stats by round, I chose to focus on the totals; the total jabs, power punches and overall punches thrown and landed in each fight.

verall punches thrown and landed in each fight.
# get punch stats per fight
def punch_stats(df):
    final_rounds_df = pd.DataFrame()
    final_df = pd.DataFrame()
    stats_pattern = re.compile('\d+\.?\d?(?=%)|\d+\/\d+')
    for index, row in df.iterrows():
        # create the data/parameters for each request
        dataload = {"event_id": row['event_id'],
                    "fighter1_id": row['fighter1id'],
                    "fighter2_id": row['fighter2id'],
                    "fighter1_name": row['fighter1ln'],
                    "fighter2_name": row['fighter2ln']
        # scrape all the round data from the response
        stats = re.findall(stats_pattern, r.text)
        slice1 = []
        for no in range(78):
        data_input = iter(stats)
        stats = [list(islice(data_input, elem)) for elem in slice1]
        slice2 = [12, 12, 12, 12, 12, 12, 3, 3]
        input2 = iter(stats)
        stats = [list(islice(input2, elem)) for elem in slice2]
        # final punch stats
        for idx, fighter in enumerate(stats[-2:]):
            total_df = pd.DataFrame(fighter)
            #add the event_id
            total_df['event_id'] = row['event_id']
            #fighters name
            if idx % 2 == 0:
                total_df['fighter'] = row['fighter1ln']
                total_df['fighter'] = row['fighter2ln']
            #add stat titles
            total_df['punch_stat'] = ['Total Punches', 'Jabs', 'Power Punches']
            #append dataframes to the corresponding dataframes
            final_df = final_df.append(total_df)
    #renaming columns
    final_df.rename(columns={0: 'punches', 1: 'pct_landed'}, inplace=True)
    #dropping duplicates
    return final_df

For brevity’s sake I did not include all the entire code I wrote and replaced some parts with ellipsis.

As always this step was followed by a cleanup and transformation process. This involved a mixture of pivoting the data from long to wide to easily aggregate punch stats for each boxer, dropping columns with information that would not be useful for either my model or interactive dashboard.

Rather unfortunately, after merging this with my data, I found that CompuBox had punch stats on around 16 – 20% of the boxers in my dataset, reducing the overall impact these stats would have on my model. In order to build a more visually appealing web app, I also decided to use the beautifulsoup library to scrape pictures of all the boxers in my dataset. The idea behind that was to make sure that if a user selected a boxer the picture of the boxer, assuming it was available would be shown right below the user’s selection.

path = ''
file = pd.read_csv(path)
for index,row in file.iterrows():
    site = row['players_links']
    response = requests.get(site)
    soup = BeautifulSoup(response.text, 'html.parser')
    pics = soup.find('img')
        pic_url = pics['src']
        urllib.request.urlretrieve(pic_url,'C:\\Users\\User\\Documents\\GitHub\\SpringboardCapstoneBoxingPredictionWebApp\\pictures\\'+ str(site.split('/')[-1])+'.jpg')
        image = ''

From inspecting the elements of each profile I noticed that only the profile picture was encapsulated in the img tag, using this knowledge, I iterated through the dataset and for each URL link to the boxer’s profile, I found every URL within the tag ‘img’ and then retrieved each picture from the url extracted, using try and except to capture any exceptions that may occur. For example capturing a case were there is no profile picture in a given page.

Building the model

In building the actual model to carry out the predictions I initially opted to use a Random Forest Classifier. To explain it simply, think about how humans typically make decisions. Let’s use the example of deciding to go to the gym. We can form a decision tree to emulate this process. I first ask myself if my body is still tired from my previous workouts. This initial question branches into two nodes; Yes and No. Assuming I am tired, I would then ask myself if my body is genuinely tired or if I simply do not feel like going to the gym. Two branches would then stem out of this node. As this process continues on all nodes each branch would ultimately generate its own conclusion/end result. I could either opt to go to the gym, choose to workout indoors at home, choose to carry out some other form of physical exercise or chose to rest. We can think of each node as a feature or a combination of features. Random Forest is essentially multiple decision trees. However, with each tree an element is randomness is introduced, each tree takes a random set of features and samples with replacement (for each item sampled, the item drawn will be returned into the dataset before the next sample is drawn). The model then makes a prediction based on the majority prediction derived from the trees.

However, long time boxing fans can attest to the fact that draws in boxing are few and far between, this was evident in my data. I was therefore dealing with highly imbalanced classes, with the word classes referring to possible outcomes in a given fight. As a result of this, evaluating my confusion matrix revealed significantly lower accuracy when it came to predicting draws. In order to ‘balance’ these predictions out, I decided to assign individual weights to each class in my dataset, assigning a significantly greater weight to draws. Through this process, hyperparameter tuning using GridSearch and filtering for the most important features, I was eventually able to push the accuracy of my draw predictions up to a little over 50%.

After further research and conversations with my mentor, I opted to test out the categorical boosting model, or ‘CatBoost’. Interestingly, this model can handle both categorical and numerical values without requiring the user to convert categorical features to numerical values in the pre-processing stage. Much like other gradient boosting algorithms CatBoost implements decision trees, but appears to reduce overfitting and requires much less parameter tuning than say XGBoost. There are articles that better explain how this model works, such as the one I have attached in this article.

Through CatBoost I was able to build a model with much better performance than my RandomForest. Assessing my confusion matrix revealed accuracy of 70-77% for wins and losses and 73% for draws.

Sharing the model

Now let’s assume hypothetical Tommy is at home about to watch Wilder v Fury 2 along with all the undercard fights on the 22nd of February 2020. Out of curiosity he wants to use this model to predict the outcome of a few fights on that night. Will the hard hitting Deontay Wilder add Tyson Fury to his highlight reel by knocking him out as easily as he did Bermane Stiverne in their rematch? Or, will Fury be on fire and leave Wilder’s defence terrified and not only retain his lineal ‘title’ but acquire the WBC heavyweight championship? In order to facilitate this process I decided to build a web app that would use the model I built to generate probabilities of a fight ending in a particular way.

To reuse the catboost model in my shiny app, I saved it as a pickle file. Since I would essentially need to call Python code in R, I required needed the reticulate package. This allows for interoperability between Python and R, letting me interface to Python. The process of building and deploying this app to can be divided into three steps. Firstly setting up the virtual environment to the desired python version for the app, and explicitly install python libraries that I would need to use on

virtualenv_create(envname = "python_environment")
virtualenv_install("python_environment", packages =c("pandas","catboost"))
use_virtualenv("python_environment",required = TRUE)

The second step involved building the user interface of the app. This can either be created as a separate file or a function inside a single file app. Here I set the background image, added styling tags, created the output tags that would react to user inputs and dropdown menus to allow users to filter boxers according to their weight class,

fluidRow(column(offset = 5, width = 2,align="center",
                                titlePanel(h5(selectInput("dropdown","Select Boxer Weights",choices=unique(boxing$division)))))),
                fluidRow(column(offset = 3, width=3,

Lastly, I created the server function that define the logic of the app. For example this defined the process of filtering the boxer dataset and the names shown to the user based on the weight class the user selected in the dropdown option pertaining to weight class and rendering images matching the IDs associated with the boxers selected.

#filter opponent names based on selection
  output$Opponent <- renderUI({
    df <- boxing %>% filter(division %in% input$dropdown)
    selectInput("names2","Opponent",choices = df$name)

Unfortunately, there are current limitations with the data I used. Boxing fans will notice that quite a few big names are missing. However, I will be updating my data regularly and should have more boxers included very soon. For those interested the app is available here – . Free to send any feedback or advice to help me improve my app on my Twitter at Emmoemm.

This article was originally posted on Hackernoon

Categories: Data Science