Back to jeromewilliams.net

Predicting BGG ratings from boardgame characteristics¶

Jerome Williams
2024 March 9

This notebook investigates how well average boardgame ratings from BoardGameGeek (BGG) can be predicted from boardgame metadata (number of players, playing time, mechanics, etc.).

We will consider the prediction question from the point of view of a hypothetical boardgame publisher, trying to predict whether the game will be a hit or not in advance of release. Thus, we will rely only on intrinsic characteristics of the game, such as the number of players, playing time, themes, mechanics, and designers, and will not consider as possible predictors any "post-release" data, such as the number of ratings on BGG or the game's publisher.

Preamble¶

Let's first load some modules we will need.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb

input_path = 'input/bggdata/'


plt.rcParams["font.family"] = "Arial"
plt.rcParams['grid.linestyle'] = 'solid'
plt.rcParams['axes.labelcolor'] = '0.2'
plt.rcParams['grid.color'] = '0.9'
plt.rcParams['xtick.color'] = '0.3'
plt.rcParams['ytick.color'] = '0.3'
plt.rcParams['axes.edgecolor'] = '0.3'
plt.rcParams['axes.titleweight']  = 'bold'
plt.rcParams['axes.facecolor'] = 'white'

Data¶

We are using the BGG data from boardgamegeek.com, which I downloaded from Jen Watkins' page on Kaggle.

The dataset has several tables, described below.

Filename Description
artists_reduced.csv Artists associated with each game, stored in one-hot format.
designers_reduced.csv Designers associated with each game, stored in one-hot format.
games.csv Tabular data for each boardgame, including year published, minimum and maximum number of players, and average rating.
mechanics.csv Game mechanics associated with each boardgame, stored in one-hot format.
publishers_reduced.csv Publishers associated with each game, stored in one-hot format.
ratings_distribution.csv The distribution of ratings for each game, stored as, for each numerical rating (e.g., 7.5), the number of times the boardgame has received that rating.
subcategories.csv Subcategories associated with each game. Example subcategories are "Exploration," "Miniatures," "Territory Building," and "Card Game." Categories, as opposed to subcategories, are stored in games.csv.
themes.csv Themes associated with each game. Example themes are "World War I", "Humor", and "Traffic".
user_ratings.csv This seems to be a complete set of ratings data. Each row is a boardgame, username, and numerical rating.

For this analysis, we will use games.csv, mechanics.csv, designers_reduced.csv, subcategories.csv and themes.csv.

In [2]:
games = pd.read_csv(input_path + 'games.csv')
mechanics = pd.read_csv(input_path + 'mechanics.csv')
subcategories = pd.read_csv(input_path + 'subcategories.csv')
designers = pd.read_csv(input_path + 'designers_reduced.csv')
themes = pd.read_csv(input_path + 'themes.csv')

Initial exploration¶

Let's first split our dataset into training and test datasets. The plots that follow will be based on the training data only.

In [3]:
train = games.sample(frac=0.8, random_state=12345)
test = games.drop(train.index)

print(f"Total: {games.shape}")
print(f"Train: {train.shape}")
print(f"Test: {test.shape}")
Total: (21925, 48)
Train: (17540, 48)
Test: (4385, 48)

Our dataset has 21,925 games. We create an 80%/20% train/test split, which results in a training set of 17,540 observations and a test set of 4,385 observations.

Rating data¶

First, let's look at the histogram of average ratings.

Our dataset includes both the "Average Rating" (raw average of user ratings, AvgRating in games.csv) and the “BGG Rating", a modified version of "Average Rating". The BGG Rating (BayesAvgRating in games.csv) modifies "Average Rating" by adding some number of dummy ratings with scores of 5.5 (out of 10) to the set of ratings, which has the effect of pulling ratings--especially those for games with few votes--closer to the middle (BoardGameGeek).

According to BoardGameGeek, the BGG Rating is used on the site to “prevent games with relatively few votes climbing to the top of the BGG Ranks” (BoardGameGeek).

Let's look at the distribution of AvgRating and BayesAvgRating.

In [4]:
plot_data = train.melt(id_vars = ['BGGId'],value_vars=['AvgRating', 'BayesAvgRating'], var_name='Rating Type', value_name='Rating')
ax = sns.kdeplot(data=plot_data, x="Rating", hue="Rating Type", fill=True)
# plt.show()

fig = ax.get_figure()

ax.grid(True)
ax.set_axisbelow(True)


# title
fig.text(0.5, 1, "BGG Boardgame Ratings", ha='center', weight='bold', size = 16)
# subtitle
fig.text(0.5, 0.94, "Distribution of Ratings", ha='center', weight='normal', size = 16)
# caption
caption_str = f"Source: BoardGameGeek.com; Jerome Williams\nNote: Plot based on training dataset of {len(train)} games."
fig.text(0.05, -0.1, caption_str, ha='left', size = 10)

ax.set_xlabel('Rating')
ax.set_ylabel('Density')

plt.savefig("output/03_avg_vs_bgg_rating_distn.pdf", format="pdf", bbox_inches = 'tight')
No description has been provided for this image

Boardgames with at least 1,000 reviews¶

If we filter to games with at least 1,000 reviews, the two distributions should be more similar.

In [5]:
plot_data = (train[train['NumUserRatings'] > 1000]
             .melt(id_vars = ['BGGId'],
                   value_vars=['AvgRating', 'BayesAvgRating'],
                   var_name='RatingType', 
                   value_name='Rating')
                   )
ax = sns.kdeplot(plot_data, x="Rating", hue="RatingType", fill=True)
fig = ax.get_figure()

ax.grid(True)
ax.set_axisbelow(True)

# title
fig.text(0.5, 1, "BGG Boardgame Ratings", ha='center', weight='bold', size = 16)
# subtitle
fig.text(0.5, 0.94, "Distribution of Ratings, Games with at least 1000 ratings", ha='center', weight='normal', size = 16)
# caption
caption_str = f"Source: BoardGameGeek.com; Jerome Williams\nNote: Plot based on training dataset of {len(train)} games, \
filtered to games with at least 1,000 ratings."
fig.text(0.05, -0.1, caption_str, ha='left', size = 10)

ax.set_xlabel('Rating')
ax.set_ylabel('Density')

plt.savefig("output/04_avg_vs_bgg_rating_distn_1000.pdf", format="pdf", bbox_inches = 'tight')
No description has been provided for this image

Although it is noiser, in this exercise, we will try to predict the "Average Rating" rather than the "BGG Rating".

Feature engineering¶

Now that we've examined the target variable, let's look at some of the potential feature variables.

Mechanics¶

First, let's examine the game mechanics variables in mechanics.csv.

Before anything else, let's join the mechanics data to our training and test set, and focus our review on the training set only, so that we don't contaminate the analysis.

In [6]:
mechanics = pd.read_csv('input/bggdata/mechanics.csv')
train = train.merge(mechanics, on = 'BGGId')
test = test.merge(mechanics, on = 'BGGId')
print(f"After joining mechanics, train: {train.shape}")
print(f"After joining mechanics, test: {test.shape}")
After joining mechanics, train: (17540, 205)
After joining mechanics, test: (4385, 205)

The mechanics data is one-hot encoded, with a column for each mechanic and a row for each game, and a 1 or 0 in each cell indicating whether a certain game uses that mechanic. Thus, we can find the total number of games with each mechanic by summing over all rows.

In [7]:
mechanics_list = mechanics.columns[1:].to_list()

mechanics_sum = train[mechanics_list].sum(axis=0).sort_values(ascending=False)
print('Top 10 mechanics:')
print(mechanics_sum[:10])

print(f'Number of unique mechanics: {len(mechanics_list)}')
Top 10 mechanics:
Dice Rolling              5207
Hand Management           3596
Set Collection            2352
Variable Player Powers    2172
Hexagon Grid              1975
Simulation                1725
Drafting                  1612
Tile Placement            1458
Modular Board             1355
Grid Movement             1310
dtype: int64
Number of unique mechanics: 157

We have 157 different mechanics. The most common mechanic is 'Dice Rolling', which is used by 5,207 games (in the training set).

Subcategories¶

Next, let's join the subcategories data. Let's add a prefix of "subcat_" to the column names so that there are no naming conflicts.

In [8]:
subcategories = pd.read_csv('input/bggdata/subcategories.csv')
subcategories_new_columns = ['subcat_' + col for col in subcategories.columns[1:].to_list()]
subcategories.columns = ['BGGId'] + subcategories_new_columns
print(len(subcategories_new_columns))
print(subcategories.columns)

train = train.merge(subcategories, on = 'BGGId')
test = test.merge(subcategories, on = 'BGGId')

print(train.shape, test.shape)
10
Index(['BGGId', 'subcat_Exploration', 'subcat_Miniatures',
       'subcat_Territory Building', 'subcat_Card Game', 'subcat_Educational',
       'subcat_Puzzle', 'subcat_Collectible Components', 'subcat_Word Game',
       'subcat_Print & Play', 'subcat_Electronic'],
      dtype='object')
(17540, 215) (4385, 215)

We have 10 different subcategories: Exploration, Miniatures, Terrritory Building, Card Game, Educational, Puzzle, Collectible components, Word Game, Print & Play, and Electronic.

Designers¶

Next, let's join the designers data.

In [9]:
designers = pd.read_csv('input/bggdata/designers_reduced.csv')
designers_columns = designers.columns.to_list()

# print(designers_columns.index('BGGId'))
# print(designers_columns.index('Low-Exp Designer'))

designers_new_columns = ['designer_' + col for col in designers.columns[:1592].to_list()]
designers.columns = designers_new_columns + ['BGGId', 'Low-Exp Designer']

train = train.merge(designers, on = 'BGGId')
test = test.merge(designers, on = 'BGGId')

print(train.shape, test.shape)
(17540, 1808) (4385, 1808)
In [10]:
designers_sum = train[designers_new_columns + ['Low-Exp Designer']].sum(axis=0).sort_values(ascending=False)
print('Top 10 designers:')
print(designers_sum[:10])
print(f'Number of unique designers: {len(designers_new_columns)}')
Top 10 designers:
Low-Exp Designer            6652
designer_(Uncredited)       1147
designer_Reiner Knizia       260
designer_Joseph Miranda      113
designer_Wolfgang Kramer     103
designer_Richard H. Berg      89
designer_Jim Dunnigan         81
designer_James Ernest         80
designer_Martin Wallace       75
designer_Frank Chadwick       74
dtype: int64
Number of unique designers: 1592

Our designers data does not include data on every designer on BGG. Instead, designers with 3 or fewer entries are lumped together as "Low-Exp Designer". 6,652 have a Low-Exp Designer among their designers. An additional 1,147 have "(Uncredited)" listed as their designer. Aside from these entry types, the most common designers in the training data are Reiner Knizia, Joseph Miranda, and Wolfgang Kramer.

Themes¶

Lastly, let's join the themes data.

In [11]:
themes = pd.read_csv('input/bggdata/themes.csv')
themes_new_columns = ['theme_' + col for col in themes.columns[1:].to_list()]
themes.columns = ['BGGId'] + themes_new_columns

train = train.merge(themes, on = 'BGGId')
test = test.merge(themes, on = 'BGGId')
print(train.shape, test.shape)
(17540, 2025) (4385, 2025)
In [12]:
designers_sum = train[themes_new_columns].sum(axis=0).sort_values(ascending=False)
print('Top 10 themes:')
print(designers_sum[:10])
print(f'Number of unique themes: {len(themes_new_columns)}')
Top 10 themes:
theme_Fantasy                      2110
theme_Science Fiction              1332
theme_Fighting                     1330
theme_Economic                     1221
theme_Animals                      1061
theme_World War II                  996
theme_Humor                         963
theme_Adventure                     924
theme_Movies / TV / Radio theme     859
theme_Medieval                      824
dtype: int64
Number of unique themes: 217

Next let's collect our features. We have 5 player count and playtime variables, 157 mechanics, 8 categories, 10 subcategories, 217 themes, and 1593 designers.

In [13]:
feature_vars = (['MinPlayers', 'MaxPlayers', 'MfgPlaytime', 'ComMinPlaytime', 'ComMaxPlaytime', 
                'Cat:Thematic',
                'Cat:Strategy',
                'Cat:War',
                'Cat:Family',
                'Cat:CGS',
                'Cat:Abstract',
                'Cat:Party',
                'Cat:Childrens'] + 
            mechanics_list + 
            subcategories_new_columns + 
            designers_new_columns + ['Low-Exp Designer'] +
            themes_new_columns)
X_train = train[feature_vars].copy()
X_test = test[feature_vars].copy()

target_var = 'AvgRating'
y_train = train[target_var]
y_test = test[target_var]
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape) 
# print(X_train.head(5))
(17540, 1990) (4385, 1990) (17540,) (4385,)

Playing time and player count¶

Let's also create some different representations of the player count variables. Our data records the minimum and maximum player counts, but the model may do better with explicit variables indicating whether a game can be played 2/3/4/etc. players.

In [14]:
X_train['plays1'] = ((X_train['MinPlayers'] == 1) & (X_train['MaxPlayers'] >= 1)).astype(int)
X_train['plays2'] = ((X_train['MinPlayers'] <= 2) & (X_train['MaxPlayers'] >= 2)).astype(int)
X_train['plays3'] = ((X_train['MinPlayers'] <= 3) & (X_train['MaxPlayers'] >= 3)).astype(int)
X_train['plays4'] = ((X_train['MinPlayers'] <= 4) & (X_train['MaxPlayers'] >= 4)).astype(int)
X_train['plays5'] = ((X_train['MinPlayers'] <= 5) & (X_train['MaxPlayers'] >= 5)).astype(int)
X_train['plays6'] = ((X_train['MinPlayers'] <= 6) & (X_train['MaxPlayers'] >= 6)).astype(int)
X_train['plays7plus'] = (X_train['MaxPlayers'] >= 7).astype(int)
X_train['num_player_counts'] = X_train['MaxPlayers'] - X_train['MinPlayers'] + 1

X_test['plays1'] = ((X_test['MinPlayers'] == 1) & (X_test['MaxPlayers'] >= 1)).astype(int)
X_test['plays2'] = ((X_test['MinPlayers'] <= 2) & (X_test['MaxPlayers'] >= 2)).astype(int)
X_test['plays3'] = ((X_test['MinPlayers'] <= 3) & (X_test['MaxPlayers'] >= 3)).astype(int)
X_test['plays4'] = ((X_test['MinPlayers'] <= 4) & (X_test['MaxPlayers'] >= 4)).astype(int)
X_test['plays5'] = ((X_test['MinPlayers'] <= 5) & (X_test['MaxPlayers'] >= 5)).astype(int)
X_test['plays6'] = ((X_test['MinPlayers'] <= 6) & (X_test['MaxPlayers'] >= 6)).astype(int)
X_test['plays7plus'] = (X_test['MaxPlayers'] >= 7).astype(int)
X_test['num_player_counts'] = X_test['MaxPlayers'] - X_test['MinPlayers'] + 1

print(X_train.shape, X_test.shape)
(17540, 1998) (4385, 1998)

Variable scaling¶

Before fitting our model, let's scale the features and target variables.

In [15]:
# scaling
from sklearn.preprocessing import RobustScaler

X_scaler = RobustScaler()

features = X_train.columns

Xtr2 = X_scaler.fit_transform(X_train[features])
Xte2 = X_scaler.transform(X_test[features])

y_scaler = RobustScaler()

ytr2 = y_scaler.fit_transform(y_train.values.reshape(-1, 1))
yte2 = y_scaler.transform(y_test.values.reshape(-1, 1))

Next, we setup the training loop. We create a cross-validation split with 5 folds. We also define an objective function for our parameter tuning. The objective function takes the current parameters as arguments, fits the model, and fits the mean score (in this case, RMSE) across the 5 folds.

We use optuna to tune the parameters (learning rate and number of trees) and use a TPE (Tree-structured Parzen Estimator) sampler (arxiv.org).

In [16]:
import optuna
from optuna.samplers import TPESampler
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

n = 5
cv = KFold(n, shuffle = True, random_state = 12345)

def objective(trial):
    
    learning_rate = trial.suggest_float("learning_rate", 0, 0.1)
    n_estimators = trial.suggest_int('n_estimators', 500, 2000)

    scores = []
    
    for i, (train_index, val_index) in enumerate(cv.split(Xtr2, ytr2)):
        Xtr2_fold, X_val = Xtr2[train_index], Xtr2[val_index]
        ytr2_fold, y_val = ytr2[train_index], ytr2[val_index]
    
        xgb_model = xgb.XGBRegressor(objective = "reg:squarederror",
                                     verbosity = 0,
                                     learning_rate = learning_rate,
                                     n_estimators = n_estimators
                                    )
        
        xgb_model.fit(Xtr2_fold, ytr2_fold)
    
        xgb_preds = xgb_model.predict(X_val)
        score = mean_squared_error(y_val, xgb_preds)
    
        scores.append(score)
    
    return np.mean(scores)

study = optuna.create_study(direction = "minimize", sampler = TPESampler())
study.optimize(func = objective, n_trials = 30)
print(study.best_params)
/Users/jeromew/.pyenv/versions/3.9.1/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
[I 2024-03-19 22:44:06,244] A new study created in memory with name: no-name-aaabedc9-46b1-4d96-8380-55cbd091e661
[I 2024-03-19 22:45:04,341] Trial 0 finished with value: 0.30738036954574566 and parameters: {'learning_rate': 0.060833069383555406, 'n_estimators': 1880}. Best is trial 0 with value: 0.30738036954574566.
[I 2024-03-19 22:45:30,533] Trial 1 finished with value: 0.31015399413744166 and parameters: {'learning_rate': 0.09460936319049677, 'n_estimators': 774}. Best is trial 0 with value: 0.30738036954574566.
[I 2024-03-19 22:46:15,825] Trial 2 finished with value: 0.307577481719661 and parameters: {'learning_rate': 0.07204047491631509, 'n_estimators': 1449}. Best is trial 0 with value: 0.30738036954574566.
[I 2024-03-19 22:46:56,794] Trial 3 finished with value: 0.3092449938464673 and parameters: {'learning_rate': 0.06923893229177526, 'n_estimators': 1261}. Best is trial 0 with value: 0.30738036954574566.
[I 2024-03-19 22:47:25,644] Trial 4 finished with value: 0.3092600644066167 and parameters: {'learning_rate': 0.09754983694887447, 'n_estimators': 857}. Best is trial 0 with value: 0.30738036954574566.
[I 2024-03-19 22:47:50,051] Trial 5 finished with value: 0.3191691204250052 and parameters: {'learning_rate': 0.0530247042495883, 'n_estimators': 665}. Best is trial 0 with value: 0.30738036954574566.
[I 2024-03-19 22:48:33,966] Trial 6 finished with value: 0.30931469706954123 and parameters: {'learning_rate': 0.06444077495302468, 'n_estimators': 1328}. Best is trial 0 with value: 0.30738036954574566.
[I 2024-03-19 22:49:26,250] Trial 7 finished with value: 0.3097184429918773 and parameters: {'learning_rate': 0.051063539675557296, 'n_estimators': 1553}. Best is trial 0 with value: 0.30738036954574566.
[I 2024-03-19 22:50:01,194] Trial 8 finished with value: 0.3249103086468331 and parameters: {'learning_rate': 0.026099593905584306, 'n_estimators': 900}. Best is trial 0 with value: 0.30738036954574566.
[I 2024-03-19 22:50:57,406] Trial 9 finished with value: 0.3151928740535336 and parameters: {'learning_rate': 0.0294326492052586, 'n_estimators': 1603}. Best is trial 0 with value: 0.30738036954574566.
[I 2024-03-19 22:52:09,694] Trial 10 finished with value: 0.32227640400023777 and parameters: {'learning_rate': 0.014564746218513225, 'n_estimators': 1973}. Best is trial 0 with value: 0.30738036954574566.
[I 2024-03-19 22:53:12,429] Trial 11 finished with value: 0.30630432455435286 and parameters: {'learning_rate': 0.0805902980585565, 'n_estimators': 1938}. Best is trial 11 with value: 0.30630432455435286.
[I 2024-03-19 22:54:17,010] Trial 12 finished with value: 0.3058130642860485 and parameters: {'learning_rate': 0.08048588227575734, 'n_estimators': 1998}. Best is trial 12 with value: 0.3058130642860485.
[I 2024-03-19 22:55:14,999] Trial 13 finished with value: 0.3062823091602316 and parameters: {'learning_rate': 0.08374472433204481, 'n_estimators': 1750}. Best is trial 12 with value: 0.3058130642860485.
[I 2024-03-19 22:56:14,012] Trial 14 finished with value: 0.306195857233181 and parameters: {'learning_rate': 0.08597443652607173, 'n_estimators': 1730}. Best is trial 12 with value: 0.3058130642860485.
[I 2024-03-19 22:57:12,271] Trial 15 finished with value: 0.30630024779280285 and parameters: {'learning_rate': 0.08288170579717015, 'n_estimators': 1729}. Best is trial 12 with value: 0.3058130642860485.
[I 2024-03-19 22:57:51,274] Trial 16 finished with value: 0.31709825247952195 and parameters: {'learning_rate': 0.03924650352231968, 'n_estimators': 1025}. Best is trial 12 with value: 0.3058130642860485.
[I 2024-03-19 22:59:42,742] Trial 17 finished with value: 0.3696725984291914 and parameters: {'learning_rate': 0.001513941838738729, 'n_estimators': 1755}. Best is trial 12 with value: 0.3058130642860485.
[I 2024-03-19 23:00:30,425] Trial 18 finished with value: 0.30697863645312873 and parameters: {'learning_rate': 0.08877811645345303, 'n_estimators': 1437}. Best is trial 12 with value: 0.3058130642860485.
[I 2024-03-19 23:01:09,064] Trial 19 finished with value: 0.309753023607684 and parameters: {'learning_rate': 0.07389433722377375, 'n_estimators': 1082}. Best is trial 12 with value: 0.3058130642860485.
[I 2024-03-19 23:02:11,566] Trial 20 finished with value: 0.3051665343626448 and parameters: {'learning_rate': 0.09816410797966438, 'n_estimators': 1831}. Best is trial 20 with value: 0.3051665343626448.
[I 2024-03-19 23:03:13,171] Trial 21 finished with value: 0.30573173424293654 and parameters: {'learning_rate': 0.09605043241004642, 'n_estimators': 1797}. Best is trial 20 with value: 0.3051665343626448.
[I 2024-03-19 23:04:16,130] Trial 22 finished with value: 0.30560561322170504 and parameters: {'learning_rate': 0.09908057813759082, 'n_estimators': 1859}. Best is trial 20 with value: 0.3051665343626448.
[I 2024-03-19 23:05:18,786] Trial 23 finished with value: 0.3063965449381953 and parameters: {'learning_rate': 0.0996817965426739, 'n_estimators': 1830}. Best is trial 20 with value: 0.3051665343626448.
[I 2024-03-19 23:06:12,928] Trial 24 finished with value: 0.30577930717232704 and parameters: {'learning_rate': 0.0925811010598444, 'n_estimators': 1595}. Best is trial 20 with value: 0.3051665343626448.
[I 2024-03-19 23:07:08,321] Trial 25 finished with value: 0.3060915718375415 and parameters: {'learning_rate': 0.09977579176829841, 'n_estimators': 1632}. Best is trial 20 with value: 0.3051665343626448.
[I 2024-03-19 23:08:12,869] Trial 26 finished with value: 0.3051846627192057 and parameters: {'learning_rate': 0.09038473485505345, 'n_estimators': 1871}. Best is trial 20 with value: 0.3051665343626448.
[I 2024-03-19 23:09:02,370] Trial 27 finished with value: 0.30722468026563676 and parameters: {'learning_rate': 0.0766276264869035, 'n_estimators': 1421}. Best is trial 20 with value: 0.3051665343626448.
[I 2024-03-19 23:10:09,002] Trial 28 finished with value: 0.3064995308658916 and parameters: {'learning_rate': 0.09012860503999219, 'n_estimators': 1893}. Best is trial 20 with value: 0.3051665343626448.
[I 2024-03-19 23:11:12,457] Trial 29 finished with value: 0.3062500412000902 and parameters: {'learning_rate': 0.06389813461055553, 'n_estimators': 1859}. Best is trial 20 with value: 0.3051665343626448.
{'learning_rate': 0.09816410797966438, 'n_estimators': 1831}

Results¶

The optimal set is a learning rate of 0.098 and 1,831 for the number of trees.

Let's now fit the model again (with the best parameters) to the whole training set.

In [17]:
study.best_params

xgb_model_best = xgb.XGBRegressor(objective = "reg:squarederror",
                                     verbosity = 0,
                                     importance_type = 'gain',
                                     **study.best_params
                                    )
xgb_model_best.fit(Xtr2, ytr2)
test_preds = xgb_model_best.predict(Xte2)
test_score = mean_squared_error(yte2, test_preds)
print(f"Test score: {test_score:.4f}")
print(f"RMSE:       {np.sqrt(test_score):.4f}")
Test score: 0.3003
RMSE:       0.5480

We have an RMSE value of 0.548. However, this RMSE is given in terms of the scaled target variable, so is not straightforward to interpret. Let's convert back to our original scale and recalculate the RMSE.

In [18]:
y_test_inverted = y_scaler.inverse_transform(yte2).flatten()
test_preds_inverted = y_scaler.inverse_transform(test_preds.reshape(-1, 1)).flatten()

test_df = pd.DataFrame({'y_test': y_test_inverted, 'test_preds': test_preds_inverted})
RMSE_inverted = np.sqrt(mean_squared_error(test_df['y_test'], test_df['test_preds']))
print(f"RMSE: {RMSE_inverted:.4f}")
RMSE: 0.6633

On our original scale, the RMSE is 0.663. This means that the average absolute difference between our predicted ratings and actual ratings is (roughly) 0.66, where ratings are on the 1-10 BGG scale.

Predictions¶

Next, let's plot predicted ratings against actual ratings.

In [48]:
ax = sns.scatterplot(data = test_df, x = 'y_test', y = 'test_preds', alpha = 0.6, marker = '.', edgecolor=None)
fig = ax.get_figure()

ax.set_xlabel('Actual rating')
ax.set_ylabel('Predicted rating')
ax.set_facecolor('white')
ax.grid(True)
ax.set_axisbelow(True)


# title
fig.text(0.5, 1, "Predicting BGG Boardgame Ratings", ha='center', weight='bold', size = 16)

# subtitle
fig.text(0.5, 0.94, "Predicted Ratings vs. Actual Ratings", ha='center', weight='normal', size = 16)

# caption
caption_text = """
Source: BoardGameGeek.com; Jerome Williams
Note: Predicted ratings from XGBoost model."""
fig.text(0.1, -0.10, caption_text, ha='left', size=9)

plt.savefig("output/01_predicted_vs_actual_XGB01.pdf", format="pdf", bbox_inches = 'tight')
No description has been provided for this image

Reassuringly, we see an upward-sloping between predicted and actual ratings. One interesting fact is that predicted ratings never go below 3.5, even though there are a handful of actual ratings in that range.

Feature importance¶

Now that we've fit our model, let's examine feature importance, the degree to which different features contribute to the predicted rating.

The feature importance for our XGBoost model corresponds to the average information gain) across all splits the feature is used in.

In [20]:
importance_df = pd.DataFrame({'feature' : features, 'importance' : xgb_model_best.feature_importances_})
importance_df.sort_values('importance', ascending = False, inplace = True)
In [49]:
ax = sns.barplot(data = importance_df.head(20), x = 'importance', y = 'feature')
fig = ax.get_figure()

ax.set_axisbelow(True)
ax.grid(True)

# title
fig.text(0.5, 1, "Predicting BGG Boardgame Ratings", ha='center', weight='bold', size = 16)

# subtitle
fig.text(0.5, 0.94, "Top 20 features by importance, XGBoost model", ha='center', weight='normal', size = 16)

# caption
caption_text = """Source: BoardGameGeek.com; Jerome Williams
Note: Feature importance is based on the gain metric, which measures the average improvement in accuracy 
across all splits the feature is used in."""

fig.text(-0.2, -0.15, caption_text, ha='left', size = 9)

ax.set_xlabel('Feature Importance')
ax.set_ylabel('Feature')
plt.show()
# plt.savefig("output/02_variable_importances_XGB01.pdf", format="pdf", bbox_inches = 'tight')
No description has been provided for this image