Predicting global game sales
The project has two goals: a) to demonstrate how to align tidymodels codes for regression analysis, and b) to demonstrate how good the model is at predicting outcomes. The ultimate aim of the study is to investigate whether or not an accurate Machine Learning model can be built to forecast video game sales in units based on the features given in this dataset. This hypothesis is investigated with numerous supervised ML models.
The data comes from the Kaggle. Motivated by Gregory Smith’s web scrape of VGChartz Video Games Sales, this data set simply extends the number of variables with another web scrape from Metacritic.
Basically, this analysis includes.
- Data manipulation
- EDA analysis
- Tidy style model and workflow creation.
- Tunning.
- Identifying the predictor importance.
- Model validation.
The data comes from Kaggle that was released in 2016. It is too old to study but there is no such available data that has similar features. For instance, this one could be an alternative that released few months ago but its features are not sufficient to build a predictive model.
# data from local file
game <- read.csv("data/Video_Games_Sales_as_at_22_Dec_2016.csv", header = TRUE)
# libraries
library <- c("tidyverse", "tidymodels", "lubridate", "kableExtra", "vip")
sapply(library, require, character.only= TRUE)
## tidyverse tidymodels lubridate kableExtra vip
## TRUE TRUE TRUE TRUE TRUE
glimpse(game)
## Rows: 16,719
## Columns: 16
## $ Name <chr> "Wii Sports", "Super Mario Bros.", "Mario Kart Wii", "…
## $ Platform <chr> "Wii", "NES", "Wii", "Wii", "GB", "GB", "DS", "Wii", "…
## $ Year_of_Release <chr> "2006", "1985", "2008", "2009", "1996", "1989", "2006"…
## $ Genre <chr> "Sports", "Platform", "Racing", "Sports", "Role-Playin…
## $ Publisher <chr> "Nintendo", "Nintendo", "Nintendo", "Nintendo", "Ninte…
## $ NA_Sales <dbl> 41.36, 29.08, 15.68, 15.61, 11.27, 23.20, 11.28, 13.96…
## $ EU_Sales <dbl> 28.96, 3.58, 12.76, 10.93, 8.89, 2.26, 9.14, 9.18, 6.9…
## $ JP_Sales <dbl> 3.77, 6.81, 3.79, 3.28, 10.22, 4.22, 6.50, 2.93, 4.70,…
## $ Other_Sales <dbl> 8.45, 0.77, 3.29, 2.95, 1.00, 0.58, 2.88, 2.84, 2.24, …
## $ Global_Sales <dbl> 82.53, 40.24, 35.52, 32.77, 31.37, 30.26, 29.80, 28.92…
## $ Critic_Score <int> 76, NA, 82, 80, NA, NA, 89, 58, 87, NA, NA, 91, NA, 80…
## $ Critic_Count <int> 51, NA, 73, 73, NA, NA, 65, 41, 80, NA, NA, 64, NA, 63…
## $ User_Score <chr> "8", "", "8.3", "8", "", "", "8.5", "6.6", "8.4", "", …
## $ User_Count <int> 322, NA, 709, 192, NA, NA, 431, 129, 594, NA, NA, 464,…
## $ Developer <chr> "Nintendo", "", "Nintendo", "Nintendo", "", "", "Ninte…
## $ Rating <chr> "E", "", "E", "E", "", "", "E", "E", "E", "", "", "E",…
skimr::skim(game)
Name | game |
Number of rows | 16719 |
Number of columns | 16 |
_______________________ | |
Column type frequency: | |
character | 8 |
numeric | 8 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Name | 0 | 1 | 0 | 132 | 2 | 11563 | 0 |
Platform | 0 | 1 | 2 | 4 | 0 | 31 | 0 |
Year_of_Release | 0 | 1 | 3 | 4 | 0 | 40 | 0 |
Genre | 0 | 1 | 0 | 12 | 2 | 13 | 0 |
Publisher | 0 | 1 | 3 | 38 | 0 | 582 | 0 |
User_Score | 0 | 1 | 0 | 3 | 6704 | 97 | 0 |
Developer | 0 | 1 | 0 | 80 | 6623 | 1697 | 0 |
Rating | 0 | 1 | 0 | 4 | 6769 | 9 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
NA_Sales | 0 | 1.00 | 0.26 | 0.81 | 0.00 | 0.00 | 0.08 | 0.24 | 41.36 | ▇▁▁▁▁ |
EU_Sales | 0 | 1.00 | 0.15 | 0.50 | 0.00 | 0.00 | 0.02 | 0.11 | 28.96 | ▇▁▁▁▁ |
JP_Sales | 0 | 1.00 | 0.08 | 0.31 | 0.00 | 0.00 | 0.00 | 0.04 | 10.22 | ▇▁▁▁▁ |
Other_Sales | 0 | 1.00 | 0.05 | 0.19 | 0.00 | 0.00 | 0.01 | 0.03 | 10.57 | ▇▁▁▁▁ |
Global_Sales | 0 | 1.00 | 0.53 | 1.55 | 0.01 | 0.06 | 0.17 | 0.47 | 82.53 | ▇▁▁▁▁ |
Critic_Score | 8582 | 0.49 | 68.97 | 13.94 | 13.00 | 60.00 | 71.00 | 79.00 | 98.00 | ▁▁▅▇▃ |
Critic_Count | 8582 | 0.49 | 26.36 | 18.98 | 3.00 | 12.00 | 21.00 | 36.00 | 113.00 | ▇▃▂▁▁ |
User_Count | 9129 | 0.45 | 162.23 | 561.28 | 4.00 | 10.00 | 24.00 | 81.00 | 10665.00 | ▇▁▁▁▁ |
The data has 16 variables, 8 numeric and 8 in character format. While no missing data appears in categorical data, three of the numeric variables suffer from the huge amount of missing values.
Since it is a small demonstration, I will just remove observations with missing values. However, in the real life situation, it might worth to analyze what type of observations are missing, are they random or systematic? It may require additional computation if you need to preserve those observations. For the some part of the EDA analysis and graphs, the book of Xijin Ge, Jianli Qi and Rong Fan is very helpful.They demonstrated different examples to faciliated learning of R programming.
We can start first with the numeric variables. To match the units of critic scores with user scores , I divided it to 10 and then converted other variables into numeric types.
game <- na.omit(game) # Remove NA in data
game <- game %>%
mutate(Critic_Score= as.numeric(as.character(game$Critic_Score)) / 10, # same decimal as user score now
User_Score = as.numeric(as.character(game$User_Score)), # character in data
Critic_Count = as.numeric(game$Critic_Count),
User_Count = as.numeric(game$User_Count))
Here you can see the distribution of global sales, game release years, and which game was on the first rank of that year.
# global sales
game %>%
group_by(Year_of_Release) %>%
summarize(Sales = sum(Global_Sales), na.rm = TRUE) %>%
ggplot(aes(x = Year_of_Release, y = Sales)) +
geom_col(fill = "navyblue") +
theme_bw()+
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "Global Sales Histograms", x = "Year", y = "Sales (units)")
# Game release years
game %>%
group_by(Year_of_Release) %>%
summarize(Number_of_Games = n()) %>%
ggplot(aes(x = Year_of_Release, y = Number_of_Games)) +
geom_col(fill = "magenta4") +
theme_bw()+
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "Games released per Year", x = "Year", y = "Sales (units)")
# Top Global Sales each year
game %>%
group_by(Year_of_Release, Publisher) %>%
summarize(Sales = sum(Global_Sales)) %>%
top_n(n = 1) %>%
kable() %>% kable_styling()
## `summarise()` has grouped output by 'Year_of_Release'. You can override using the `.groups` argument.
## Selecting by Sales
Year_of_Release | Publisher | Sales |
---|---|---|
1985 | Activision | 0.03 |
1988 | Maxis | 0.03 |
1992 | id Software | 0.03 |
1994 | Sony Computer Entertainment | 1.27 |
1996 | Sony Computer Entertainment | 7.66 |
1997 | Sony Computer Entertainment | 22.79 |
1998 | Sony Computer Entertainment | 15.67 |
1999 | Sony Computer Entertainment | 17.84 |
2000 | Electronic Arts | 16.80 |
2001 | Sony Computer Entertainment | 40.04 |
2002 | Electronic Arts | 66.72 |
2003 | Electronic Arts | 67.86 |
2004 | Electronic Arts | 62.30 |
2005 | Nintendo | 94.52 |
2006 | Nintendo | 177.06 |
2007 | Nintendo | 92.49 |
2008 | Nintendo | 83.29 |
2009 | Nintendo | 103.35 |
2010 | Electronic Arts | 70.86 |
2011 | Electronic Arts | 54.25 |
2012 | Activision | 47.44 |
2013 | Take-Two Interactive | 51.34 |
2014 | Ubisoft | 31.49 |
2015 | Electronic Arts | 23.54 |
2016 | Electronic Arts | 25.05 |
N/A | Electronic Arts | 16.91 |
The release year of the games are available in data and we can create a new variable as the age of games.
# we can create a new variable as age of game
game <- game %>%
mutate(Age = 2018 - as.numeric(game$Year_of_Release))
The distribution of user and critic counts are so skewed, which leads me to take their log. Moreover, the value of global sales are not in their true unit, I also multiply it with 1000000. Nevertheless, this one also so skewed but I will take its log in modeling stage.
# log Critic and user counts sincethey are skewed
game <- game %>%
mutate(Global_Sales = Global_Sales * 1000000) %>% # actual sales
mutate(Critic.Count.Log = log(Critic_Count),
User.Count.Log = log(User_Count)
)
Finally, we can create histogram and qqplot and look at their distribution, and examine shapiro.test results for each numeric variables.
# brings all numeric columns' name
chrs <- sapply(game, is.numeric)
name <- names(game[, chrs])
par(mfrow = c(4, 4)) # Layout outputs in 4 rows and 4 columns
for (i in 1:length(name)){
sub <- sample(game[name[i]][, 1], 5000)
submean <- mean(sub)
hist(sub, main = paste("Hist. of", name[i], sep = " "), xlab = name[i])
abline(v = submean, col = "blue", lwd = 1)
qqnorm(sub, main = paste("Q-Q Plot of", name[i], sep = " "))
qqline(sub)
if (i == 1) {s.t <- shapiro.test(sub)
} else {s.t <- rbind(s.t, shapiro.test(sub))
}
}
s.t <- s.t[, 1:2] # Take first two columns of shapiro.test result
s.t <- cbind(name, s.t) # Add variable name for the result
s.t
## name statistic p.value
## s.t "NA_Sales" 0.301268 1.494323e-87
## "EU_Sales" 0.3588133 1.280918e-85
## "JP_Sales" 0.2212251 5.054568e-90
## "Other_Sales" 0.2293389 8.786447e-90
## "Global_Sales" 0.3649317 2.098002e-85
## "Critic_Score" 0.9625661 4.109953e-34
## "Critic_Count" 0.9187605 8.872664e-46
## "User_Score" 0.9135635 8.65544e-47
## "User_Count" 0.2851841 4.559314e-88
## "Age" 0.9800762 1.020473e-25
## "Critic.Count.Log" 0.9774761 1.968921e-27
## "User.Count.Log" 0.9425133 2.375158e-40
First of all, let's check out distribution of games according to platforms. I also combine sub platforms with their main sources. As seen, most of the games were sold in Playstation platform, following by Nintendo and Xbox. The results show that people tend to play games mainly in these tree platforms.
################ Platforms
# Regroup platform as Platform.type
pc <- c("PC")
xbox <- c("X360", "XB", "XOne")
nintendo <- c("Wii", "WiiU", "N64", "GC", "NES", "3DS", "DS")
playstation <- c("PS", "PS2", "PS3", "PS4", "PSP", "PSV")
game <- game %>%
mutate(Platform.type = ifelse(Platform %in% pc, "PC",
ifelse(Platform %in% xbox, "Xbox",
ifelse(Platform %in% nintendo, "Nintendo",
ifelse(Platform %in% playstation, "Playstation", "Others")))))
# plot
ggplot(game, aes(x = Platform.type)) + geom_bar(fill = "blue") +
theme_minimal()
I also follow a similar approach here. First, I delete levels if they are empty in ratings and later I merged rare ratings with most relevant ones. Finally, you can see their distribution in mosaic plot.
game <- game %>% filter(Rating != "") %>% droplevels() #remove empty rating observations
table(game$Rating)
##
## AO E E10+ K-A M RP T
## 1 2118 946 1 1459 2 2420
# collapse rare Ratings
game <- game %>% mutate(Rating = ifelse(Rating == "AO", "M", Rating),
Rating = ifelse(Rating == "K-A", "E", Rating),
Rating = ifelse(Rating == "RP", "E", Rating))
table(game$Rating)
##
## E E10+ M T
## 2121 946 1460 2420
library(ggmosaic)
library(plotly)
p <- ggplot(game) +
geom_mosaic(aes(x = product(Rating), fill = Platform.type), na.rm = TRUE) +
labs(x = "Rating Type", y = "Platform Type", title="Mosaic Plot") +
theme(axis.text.y = element_blank())
ggplotly(p)
<div id="htmlwidget-dfb957248938e3fe1d24" style="width:672px;height:480px;" class="plotly html-widget"></div>
<script type="application/json" data-for="htmlwidget-dfb957248938e3fe1d24">{"x":{"data":[{"x":[0,0,0.296152295955089,0.296152295955089,0],"y":[0,0.302027230845119,0.302027230845119,0,0],"text":"Nintendo<br>E<br>Frequency: 666","key":["Nintendo","E"],"type":"scatter","mode":"lines","line":{"width":0.377952755905512,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(248,118,109,0.8)","hoveron":"fills","name":"Nintendo","legendgroup":"Nintendo","showlegend":true,"xaxis":"x","yaxis":"y","hoverinfo":"text","_isSimpleKey":true,"_isNestedKey":false,"frame":null},{"x":[0.306152295955089,0.306152295955089,0.438240967324025,0.438240967324025,0.306152295955089],"y":[0,0.390438493265526,0.390438493265526,0,0],"text":"Nintendo<br>E10+<br>Frequency: 384","key":["Nintendo","E10+"],"type":"scatter","mode":"lines","line":{"width":0.377952755905512,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(248,118,109,0.8)","hoveron":"fills","name":"Nintendo","legendgroup":"Nintendo","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","_isSimpleKey":true,"_isNestedKey":false,"frame":null},{"x":[0.448240967324025,0.448240967324025,0.652098747660861,0.652098747660861,0.448240967324025],"y":[0,0.0645633059247653,0.0645633059247653,0,0],"text":"Nintendo<br>M<br>Frequency: 98","key":["Nintendo","M"],"type":"scatter","mode":"lines","line":{"width":0.377952755905512,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(248,118,109,0.8)","hoveron":"fills","name":"Nintendo","legendgroup":"Nintendo","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","_isSimpleKey":true,"_isNestedKey":false,"frame":null},{"x":[0.662098747660861,0.662098747660861,1,1,0.662098747660861],"y":[0,0.164152395878373,0.164152395878373,0,0],"text":"Nintendo<br>T<br>Frequency: 413","key":["Nintendo","T"],"type":"scatter","mode":"lines","line":{"width":0.377952755905512,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(248,118,109,0.8)","hoveron":"fills","name":"Nintendo","legendgroup":"Nintendo","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","_isSimpleKey":true,"_isNestedKey":false,"frame":null},{"x":[0,0,0.296152295955089,0.296152295955089,0],"y":[0.311561856737575,0.391376860684633,0.391376860684633,0.311561856737575,0.311561856737575],"text":"Others<br>E<br>Frequency: 176","key":["Others","E"],"type":"scatter","mode":"lines","line":{"width":0.377952755905512,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(163,165,0,0.8)","hoveron":"fills","name":"Others","legendgroup":"Others","showlegend":true,"xaxis":"x","yaxis":"y","hoverinfo":"text","_isSimpleKey":true,"_isNestedKey":false,"frame":null},{"x":[0.306152295955089,0.306152295955089,0.438240967324025,0.438240967324025,0.306152295955089],"y":[0.399973119157982,0.418274923529804,0.418274923529804,0.399973119157982,0.399973119157982],"text":"Others<br>E10+<br>Frequency: 18","key":["Others","E10+"],"type":"scatter","mode":"lines","line":{"width":0.377952755905512,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(163,165,0,0.8)","hoveron":"fills","name":"Others","legendgroup":"Others","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","_isSimpleKey":true,"_isNestedKey":false,"frame":null},{"x":[0.448240967324025,0.448240967324025,0.652098747660861,0.652098747660861,0.448240967324025],"y":[0.0740979318172212,0.078709596526133,0.078709596526133,0.0740979318172212,0.0740979318172212],"text":"Others<br>M<br>Frequency: 7","key":["Others","M"],"type":"scatter","mode":"lines","line":{"width":0.377952755905512,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(163,165,0,0.8)","hoveron":"fills","name":"Others","legendgroup":"Others","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","_isSimpleKey":true,"_isNestedKey":false,"frame":null},{"x":[0.662098747660861,0.662098747660861,1,1,0.662098747660861],"y":[0.173687021770829,0.195150046897783,0.195150046897783,0.173687021770829,0.173687021770829],"text":"Others<br>T<br>Frequency: 54","key":["Others","T"],"type":"scatter","mode":"lines","line":{"width":0.377952755905512,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(163,165,0,0.8)","hoveron":"fills","name":"Others","legendgroup":"Others","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","_isSimpleKey":true,"_isNestedKey":false,"frame":null},{"x":[0,0,0.296152295955089,0.296152295955089,0],"y":[0.400911486577089,0.447167909319135,0.447167909319135,0.400911486577089,0.400911486577089],"text":"PC<br>E<br>Frequency: 102","key":["PC","E"],"type":"scatter","mode":"lines","line":{"width":0.377952755905512,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(0,191,125,0.8)","hoveron":"fills","name":"PC","legendgroup":"PC","showlegend":true,"xaxis":"x","yaxis":"y","hoverinfo":"text","_isSimpleKey":true,"_isNestedKey":false,"frame":null},{"x":[0.306152295955089,0.306152295955089,0.438240967324025,0.438240967324025,0.306152295955089],"y":[0.427809549422259,0.493899398542726,0.493899398542726,0.427809549422259,0.427809549422259],"text":"PC<br>E10+<br>Frequency: 65","key":["PC","E10+"],"type":"scatter","mode":"lines","line":{"width":0.377952755905512,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(0,191,125,0.8)","hoveron":"fills","name":"PC","legendgroup":"PC","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","_isSimpleKey":true,"_isNestedKey":false,"frame":null},{"x":[0.448240967324025,0.448240967324025,0.652098747660861,0.652098747660861,0.448240967324025],"y":[0.0882442224185889,0.252287724207023,0.252287724207023,0.0882442224185889,0.0882442224185889],"text":"PC<br>M<br>Frequency: 249","key":["PC","M"],"type":"scatter","mode":"lines","line":{"width":0.377952755905512,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(0,191,125,0.8)","hoveron":"fills","name":"PC","legendgroup":"PC","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","_isSimpleKey":true,"_isNestedKey":false,"frame":null},{"x":[0.662098747660861,0.662098747660861,1,1,0.662098747660861],"y":[0.204684672790239,0.304447993287749,0.304447993287749,0.204684672790239,0.204684672790239],"text":"PC<br>T<br>Frequency: 251","key":["PC","T"],"type":"scatter","mode":"lines","line":{"width":0.377952755905512,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(0,191,125,0.8)","hoveron":"fills","name":"PC","legendgroup":"PC","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","_isSimpleKey":true,"_isNestedKey":false,"frame":null},{"x":[0,0,0.296152295955089,0.296152295955089,0],"y":[0.45670253521159,0.811335109567271,0.811335109567271,0.45670253521159,0.45670253521159],"text":"Playstation<br>E<br>Frequency: 782","key":["Playstation","E"],"type":"scatter","mode":"lines","line":{"width":0.377952755905512,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(0,176,246,0.8)","hoveron":"fills","name":"Playstation","legendgroup":"Playstation","showlegend":true,"xaxis":"x","yaxis":"y","hoverinfo":"text","_isSimpleKey":true,"_isNestedKey":false,"frame":null},{"x":[0.306152295955089,0.306152295955089,0.438240967324025,0.438240967324025,0.306152295955089],"y":[0.503434024435182,0.812531164937057,0.812531164937057,0.503434024435182,0.503434024435182],"text":"Playstation<br>E10+<br>Frequency: 304","key":["Playstation","E10+"],"type":"scatter","mode":"lines","line":{"width":0.377952755905512,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(0,176,246,0.8)","hoveron":"fills","name":"Playstation","legendgroup":"Playstation","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","_isSimpleKey":true,"_isNestedKey":false,"frame":null},{"x":[0.448240967324025,0.448240967324025,0.652098747660861,0.652098747660861,0.448240967324025],"y":[0.261822350099479,0.670942890704369,0.670942890704369,0.261822350099479,0.261822350099479],"text":"Playstation<br>M<br>Frequency: 621","key":["Playstation","M"],"type":"scatter","mode":"lines","line":{"width":0.377952755905512,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(0,176,246,0.8)","hoveron":"fills","name":"Playstation","legendgroup":"Playstation","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","_isSimpleKey":true,"_isNestedKey":false,"frame":null},{"x":[0.662098747660861,0.662098747660861,1,1,0.662098747660861],"y":[0.313982619180205,0.767488390844185,0.767488390844185,0.313982619180205,0.313982619180205],"text":"Playstation<br>T<br>Frequency: 1141","key":["Playstation","T"],"type":"scatter","mode":"lines","line":{"width":0.377952755905512,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(0,176,246,0.8)","hoveron":"fills","name":"Playstation","legendgroup":"Playstation","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","_isSimpleKey":true,"_isNestedKey":false,"frame":null},{"x":[0,0,0.296152295955089,0.296152295955089,0],"y":[0.820869735459727,1,1,0.820869735459727,0.820869735459727],"text":"Xbox<br>E<br>Frequency: 395","key":["Xbox","E"],"type":"scatter","mode":"lines","line":{"width":0.377952755905512,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(231,107,243,0.8)","hoveron":"fills","name":"Xbox","legendgroup":"Xbox","showlegend":true,"xaxis":"x","yaxis":"y","hoverinfo":"text","_isSimpleKey":true,"_isNestedKey":false,"frame":null},{"x":[0.306152295955089,0.306152295955089,0.438240967324025,0.438240967324025,0.306152295955089],"y":[0.822065790829513,1,1,0.822065790829513,0.822065790829513],"text":"Xbox<br>E10+<br>Frequency: 175","key":["Xbox","E10+"],"type":"scatter","mode":"lines","line":{"width":0.377952755905512,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(231,107,243,0.8)","hoveron":"fills","name":"Xbox","legendgroup":"Xbox","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","_isSimpleKey":true,"_isNestedKey":false,"frame":null},{"x":[0.448240967324025,0.448240967324025,0.652098747660861,0.652098747660861,0.448240967324025],"y":[0.680477516596825,1,1,0.680477516596825,0.680477516596825],"text":"Xbox<br>M<br>Frequency: 485","key":["Xbox","M"],"type":"scatter","mode":"lines","line":{"width":0.377952755905512,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(231,107,243,0.8)","hoveron":"fills","name":"Xbox","legendgroup":"Xbox","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","_isSimpleKey":true,"_isNestedKey":false,"frame":null},{"x":[0.662098747660861,0.662098747660861,1,1,0.662098747660861],"y":[0.777023016736641,1,1,0.777023016736641,0.777023016736641],"text":"Xbox<br>T<br>Frequency: 561","key":["Xbox","T"],"type":"scatter","mode":"lines","line":{"width":0.377952755905512,"color":"transparent","dash":"solid"},"fill":"toself","fillcolor":"rgba(231,107,243,0.8)","hoveron":"fills","name":"Xbox","legendgroup":"Xbox","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","_isSimpleKey":true,"_isNestedKey":false,"frame":null}],"layout":{"margin":{"t":43.7625570776256,"r":7.30593607305936,"b":40.1826484018265,"l":25.5707762557078},"plot_bgcolor":"rgba(235,235,235,1)","paper_bgcolor":"rgba(255,255,255,1)","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"title":{"text":"Mosaic Plot","font":{"color":"rgba(0,0,0,1)","family":"","size":17.5342465753425},"x":0,"xref":"paper"},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[-0.05,1.05],"tickmode":"array","ticktext":["E","E10+","M","T"],"tickvals":[0.148076147977544,0.372196631639557,0.550169857492443,0.83104937383043],"categoryorder":"array","categoryarray":["E","E10+","M","T"],"nticks":null,"ticks":"outside","tickcolor":"rgba(51,51,51,1)","ticklen":3.65296803652968,"tickwidth":0.66417600664176,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(255,255,255,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"y","title":{"text":"Rating Type","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[-0.05,1.05],"tickmode":"array","ticktext":["Nintendo","Others","PC","Playstation","Xbox"],"tickvals":[0.15101361542256,0.351469358711104,0.424039697948112,0.634018822389431,0.910434867729863],"categoryorder":"array","categoryarray":["Nintendo","Others","PC","Playstation","Xbox"],"nticks":null,"ticks":"outside","tickcolor":"rgba(51,51,51,1)","ticklen":3.65296803652968,"tickwidth":0.66417600664176,"showticklabels":false,"tickfont":{"color":null,"family":null,"size":0},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(255,255,255,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"x","title":{"text":"Platform Type","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":null,"line":{"color":null,"width":0,"linetype":[]},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":true,"legend":{"bgcolor":"rgba(255,255,255,1)","bordercolor":"transparent","borderwidth":1.88976377952756,"font":{"color":"rgba(0,0,0,1)","family":"","size":11.689497716895},"title":{"text":"Platform.type","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}}},"hovermode":"closest","barmode":"relative"},"config":{"doubleClick":"reset","modeBarButtonsToAdd":["hoverclosest","hovercompare"],"showSendToCloud":false},"source":"A","attrs":{"c40730510ac7":{"x":{},"fill":{},"y":{},"x__fill__Platform.type":{},"x__Rating":{},"type":"scatter"}},"cur_data":"c40730510ac7","visdat":{"c40730510ac7":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.2,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot","plotly_sunburstclick"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script>
We can also look at the distribution of genres in the ring plot.
dat <- data.frame(table(game$Genre))
dat$fraction <- dat$Freq / sum(dat$Freq)
dat <- dat[order(dat$fraction), ]
dat$ymax <- cumsum(dat$fraction)
dat$ymin <- c(0, head(dat$ymax, n = -1))
names(dat)[1] <- "Genre"
#Plot
ggplot(dat, aes(fill = Genre, ymax = ymax, ymin = ymin, xmax = 4, xmin = 3)) +
geom_rect(colour = "grey30") + # Background color
coord_polar(theta = "y") + # Coordinate system to polar
xlim(c(0, 4)) +
theme_bw()+
labs(title = "Ring plot for Genre", fill = "Genre") +
theme(plot.title = element_text(hjust = 0.5))
Before selecting our features for model building, it is important to know wheteher there is high correlation between our continuous variables.
# Take numeric variables as goal matrix
st <- game[, sapply(game, class) %in% c('numeric')]
st <- na.omit(st)
library(ellipse)
library(corrplot)
corMatrix <- cor(as.matrix(st)) # Correlation matrix
col <- colorRampPalette(c("#7F0000", "red", "#FF7F00", "yellow", "#7FFF7F",
"cyan", "#007FFF", "blue", "#00007F"))
corrplot.mixed(corMatrix, order = "AOE", lower = "number", lower.col = "black",
number.cex = .8, upper = "ellipse", upper.col = col(10),
diag = "u", tl.pos = "lt", tl.col = "black")
plot(hclust(as.dist(1 - cor(as.matrix(st))))) # Hierarchical clustering
We can create a predictive model both in regression and classification. If we would like follow a classification approach, we can convert global sales into a binary variable, such as sales more than 1.000.000 and less. Nevertheless, I will demonstrate here a regression model since the main features are also continuous variables.
I select three categorical variables and the rest are now numeric variables.
# final data
final_data <- game %>%
select( Age, Genre, Global_Sales,Critic_Score, Critic.Count.Log,
User_Score, User.Count.Log, Rating, Platform.type)
glimpse(final_data)
## Rows: 6,947
## Columns: 9
## $ Age <dbl> 12, 10, 9, 12, 12, 9, 13, 11, 8, 9, 5, 14, 13, 5, 16,…
## $ Genre <chr> "Sports", "Racing", "Sports", "Platform", "Misc", "Pl…
## $ Global_Sales <dbl> 82530000, 35520000, 32770000, 29800000, 28920000, 283…
## $ Critic_Score <dbl> 7.6, 8.2, 8.0, 8.9, 5.8, 8.7, 9.1, 8.0, 6.1, 8.0, 9.7…
## $ Critic.Count.Log <dbl> 3.931826, 4.290459, 4.290459, 4.174387, 3.713572, 4.3…
## $ User_Score <dbl> 8.0, 8.3, 8.0, 8.5, 6.6, 8.4, 8.6, 7.7, 6.3, 7.4, 8.2…
## $ User.Count.Log <dbl> 5.774552, 6.563856, 5.257495, 6.066108, 4.859812, 6.3…
## $ Rating <chr> "E", "E", "E", "E", "E", "E", "E", "E", "E", "E", "M"…
## $ Platform.type <chr> "Nintendo", "Nintendo", "Nintendo", "Nintendo", "Nint…
Tidymodels is my favorite in modeling and all this process is mainly based on tidymodel functions. Here the split ratio is 0.25 vs 0.75.
set.seed(1234)
data_split <- initial_split(final_data,
prop = 0.75,
strata = Global_Sales) # outcome
# Create training data
train_df <- data_split %>% training()
test_df <- data_split %>% testing()
dim(train_df); dim(test_df)
## [1] 5208 9
## [1] 1739 9
I use very basic recipe model here. We convert all character column into dummy variables, I impute missing values in age (122 NAs), normalize the variables and finally take the log of our outcome variable since it is also skewed.
recete <- recipe(Global_Sales ~ ., train_df) %>%
# One-hot encoding of categorical variables
step_dummy(all_nominal()) %>%
# impute missings
step_knnimpute(Age) %>%
# Center and scale the variables
step_normalize(all_numeric(), -all_outcomes()) %>%
# Normalize the outcome variable
step_log(Global_Sales)
recipe_prepped <- prep(recete)
df_train <- bake(recipe_prepped, new_data = train_df)
dim(df_train)
## [1] 5208 24
df_test <- recipe_prepped %>%
bake(test_df)
dim(df_test)
## [1] 1739 24
Since we do not have so much variable, I did not include a code in recipe to directly remove highly correlated variables. For that reason, it is wise to check whether there is a high correlation among our variables.
library(corrr)
corr_df <- df_train %>% select(is.numeric) %>%
correlate() %>%
rearrange() %>%
shave()
rplot(corr_df,) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90)) +
scale_colour_viridis_c()
To increase the mode performance and validation, I will use 10 fold validation procedure. I also use parallel processing to speed up the analysis
########### validation folds
k_folds <- vfold_cv(train_df)
############### do paralel
library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)
Here I will use four models. After checking their results, I will pick the best model for tuning.
lm_model <- linear_reg() %>%
set_engine("lm")
lm_wf <-
workflow() %>%
add_recipe(recete) %>%
add_model(lm_model)
set.seed(123)
lm_fit <-
lm_wf %>%
fit_resamples(resamples =k_folds)
lm_fit %>%
collect_metrics()
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 0.980 10 0.0131 Preprocessor1_Model1
## 2 rsq standard 0.509 10 0.0125 Preprocessor1_Model1
set.seed(123)
lm_last_fit <-
lm_wf %>%
last_fit(split =data_split)
results_lm <- lm_last_fit %>%
collect_metrics()
lm_last_fit %>%
collect_predictions() %>%
ggplot(aes(.pred, Global_Sales)) +
geom_abline(col = "green", lty = 2) +
geom_point(alpha = .4, colour = "midnightblue") +
annotate("text", x = Inf, y = Inf, hjust = 1.5, vjust = 2.5,
label = paste("RMSE: ", round(results_lm$.estimate[1], 4))) +
labs(title = "Linear regression model",
subtitle = "Variety of steps in recipe, glmnet engine")
glm_model <-
linear_reg(penalty = 0.001, mixture = 0.5) %>%
set_engine("glmnet")
glm_wf <-
workflow() %>%
add_recipe(recete) %>%
add_model(glm_model)
set.seed(123)
glm_fit <-
glm_wf %>%
fit_resamples(resamples =k_folds)
glm_fit %>%
collect_metrics()
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 0.980 10 0.0131 Preprocessor1_Model1
## 2 rsq standard 0.509 10 0.0125 Preprocessor1_Model1
set.seed(123)
glm_last_fit <-
glm_wf %>%
last_fit(split =data_split)
results_glm <- glm_last_fit %>%
collect_metrics()
glm_last_fit %>%
collect_predictions() %>%
ggplot(aes(.pred, Global_Sales)) +
geom_abline(col = "green", lty = 2) +
geom_point(alpha = .4, colour = "midnightblue") +
annotate("text", x = Inf, y = Inf, hjust = 1.5, vjust = 2.5,
label = paste("RMSE: ", round(results_glm$.estimate[1], 4))) +
labs(title = "Lasso regression model",
subtitle = "Variety of steps in recipe, glmnet engine")
# model
rf_model <-
rand_forest() %>%
set_engine("ranger") %>%
set_mode("regression")
# workflow
rf_wf <-
workflow() %>%
add_recipe(recete) %>%
add_model(rf_model)
summary(rf_wf)
## Length Class Mode
## pre 2 stage_pre list
## fit 2 stage_fit list
## post 1 stage_post list
## trained 1 -none- logical
# model fit
set.seed(123)
rf_fit <-
rf_wf %>%
fit_resamples(resamples = k_folds)
# metrics
rf_fit %>%
collect_metrics()
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 0.931 10 0.0127 Preprocessor1_Model1
## 2 rsq standard 0.570 10 0.0131 Preprocessor1_Model1
# last fit
set.seed(123)
rf_last_fit <-
rf_wf %>%
last_fit(split = data_split)
# metrics
rf_results <- rf_last_fit %>%
collect_metrics()
# plot
rf_last_fit %>%
collect_predictions() %>%
ggplot(aes(.pred, Global_Sales)) +
geom_abline(col = "green", lty = 2) +
geom_point(alpha = .4, colour = "midnightblue") +
annotate("text", x = Inf, y = Inf, hjust = 1.5, vjust = 2,
label = paste("RMSE: ", round(rf_results$.estimate[1], 4))) +
labs(title = "Random tree model",
subtitle = "No hyperparameters tuned, ranger engine")
xgb_model <-
boost_tree(
mode = "regression",
trees = 500,
learn_rate = 0.1,
min_n = 11,
tree_depth = 8
) %>%
set_engine(
"xgboost",
booster = "gbtree"
)
# workflow
xgb_wf <-
workflow() %>%
add_recipe(recete) %>%
add_model(xgb_model)
summary(xgb_wf)
## Length Class Mode
## pre 2 stage_pre list
## fit 2 stage_fit list
## post 1 stage_post list
## trained 1 -none- logical
####### model fit
set.seed(123)
xgb_fit <-
xgb_wf %>%
fit_resamples(resamples = k_folds)
# metrics
xgb_fit %>%
collect_metrics()
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 0.941 10 0.0133 Preprocessor1_Model1
## 2 rsq standard 0.554 10 0.0127 Preprocessor1_Model1
# last fit
set.seed(123)
xgb_last_fit <-
xgb_wf %>%
last_fit(split = data_split)
# metrics
xgb_results <- xgb_last_fit %>%
collect_metrics()
# plot
xgb_last_fit %>%
collect_predictions() %>%
ggplot(aes(.pred, Global_Sales)) +
geom_abline(col = "green", lty = 2) +
geom_point(alpha = .4, colour = "midnightblue") +
annotate("text", x = Inf, y = Inf, hjust = 1.5, vjust = 2,
label = paste("RMSE: ", round(xgb_results$.estimate[1], 4))) +
labs(title = "XGB model",
subtitle = "No hyperparameters tuned, ranger engine")
##### comparing rmse and rsq
lm_score <- lm_fit %>%
collect_metrics() %>%
mutate(model= "lm")
glm_score <- glm_fit %>%
collect_metrics()%>%
mutate(model= "glm")
rf_score <- rf_fit %>%
collect_metrics()%>%
mutate(model= "rf")
xgb_score <- xgb_fit %>%
collect_metrics()%>%
mutate(model= "xgb")
score_table <- bind_rows(lm_score, glm_score, rf_score, xgb_score) %>%
select(-.config)
score_table %>%
filter(.metric=="rmse") %>%
ggplot( aes(x= mean, y= reorder(model, mean)))+
geom_bar(stat = "identity") +
theme_bw()+
labs(y= "Models",
x= "RMSE means")
############ graph
# Extract metrics from our models to compare them:
lm_metrics <-
lm_last_fit %>%
collect_predictions() %>%
rename("lm" = .pred) %>%
select("lm")
glm_metrics <-
glm_last_fit %>%
collect_predictions() %>%
rename("glmnet" = .pred) %>%
select("glmnet")
rf_metrics <-
rf_last_fit %>%
collect_predictions() %>%
rename("rf" = .pred) %>%
select("rf")
xgb_metrics <-
xgb_last_fit %>%
collect_predictions() %>%
rename("xgb" = .pred) %>%
select("xgb")
model_compare <-
rf_last_fit%>%
collect_predictions() %>%
bind_cols(
glm_metrics,
rf_metrics,
lm_metrics,
xgb_metrics
) %>%
select(-id, -.pred, -.row)
model_compare %>% pivot_longer(
cols = c(glmnet, rf,lm , xgb),
names_to = "model",
values_to = "prediction") %>%
select(-.config)
## # A tibble: 6,956 × 3
## Global_Sales model prediction
## <dbl> <chr> <dbl>
## 1 17.2 glmnet 14.9
## 2 17.2 rf 15.1
## 3 17.2 lm 14.9
## 4 17.2 xgb 15.8
## 5 16.8 glmnet 14.2
## 6 16.8 rf 13.6
## 7 16.8 lm 14.2
## 8 16.8 xgb 13.6
## 9 16.6 glmnet 15.2
## 10 16.6 rf 15.2
## # … with 6,946 more rows
model_compare %>%
pivot_longer(
cols = c(glmnet,lm, rf, xgb ),
names_to = "model",
values_to = "prediction") %>%
select(-.config) %>%
ggplot(aes(prediction, Global_Sales)) +
geom_abline(col = "green", lty = 2) +
geom_point(col = "midnightblue", alpha = .7) +
facet_wrap(~model) +
theme_bw() +
coord_fixed()
Our results show that Random Forest model provides better results, with a lower rmse and higher R square scores.
We can tune random RF model to improve the model.
set.seed(256)
rf_mod <-
rand_forest(trees = 500,
mtry = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity",
num.threads = 12) %>%
set_mode("regression")
#Establish Model Flow
tune_wf <- workflow() %>%
add_recipe(recete) %>%
add_model(rf_mod)
#Generate grid to perform grid search for hyper parameter optimization
rf_grid <- grid_regular(mtry(range = c(6,10)),
min_n(range = c(14,20)),
levels = c(10,9))
# 10-fold Cross Validation Stratified by Global_Sales
folds <- vfold_cv(train_df, v = 10, strata = Global_Sales)
#Train and evaluate all combinations of hyperparameters
# specified in rf_grid
doParallel::registerDoParallel(cores = 12)
rf_grid_search <- tune_grid(
tune_wf,
resamples = folds,
grid = rf_grid)
rf_grid_search %>%
collect_metrics() %>%
filter(.metric == "rmse") %>%
select(mean, min_n, mtry) %>%
filter(mtry > 4) %>%
ggplot(aes(min_n, mean, color = as_factor(mtry))) +
geom_point()+
geom_line()+
scale_color_viridis_d() +
theme_minimal()+
scale_x_continuous(breaks = pretty_breaks())+
theme(legend.position = "bottom") +
labs(x = "Minimum Number of Observations to Split Node",
y = "RMSE",
title = "Grid Search Results for Random Forest",
color = "Number of Predictors Sampled at Each Split")
# show best
rf_grid_search %>% show_best()
## # A tibble: 5 × 8
## mtry min_n .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 10 14 rmse standard 0.917 10 0.00864 Preprocessor1_Model05
## 2 10 16 rmse standard 0.917 10 0.00852 Preprocessor1_Model15
## 3 8 14 rmse standard 0.917 10 0.00850 Preprocessor1_Model03
## 4 9 14 rmse standard 0.917 10 0.00862 Preprocessor1_Model04
## 5 10 15 rmse standard 0.917 10 0.00866 Preprocessor1_Model10
rf_best_rmse <- select_best(rf_grid_search, "rmse")
final_rf <- finalize_model(
rf_mod,
rf_best_rmse
)
final_rf
## Random Forest Model Specification (regression)
##
## Main Arguments:
## mtry = 10
## trees = 500
## min_n = 14
##
## Engine-Specific Arguments:
## importance = impurity
## num.threads = 12
##
## Computational engine: ranger
final_rf %>%
fit(Global_Sales ~., data = bake(prep(recete),training(data_split))) %>%
vip(geom=c("col"), num_features = 10) +
theme_minimal()
In the tuning process, I am able to reduce the rmse scores from 0.930 to 0.910 and keep the R square.
final_wf <-
workflow() %>%
add_recipe(recete) %>%
add_model(final_rf)
overallfit <-final_wf %>%
tune::last_fit(data_split)
overallfit %>%
collect_metrics()
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 rmse standard 0.913 Preprocessor1_Model1
## 2 rsq standard 0.570 Preprocessor1_Model1
overallfit %>%
collect_predictions() %>%
ggplot(aes(x= Global_Sales, y = .pred)) +
geom_abline(col = "green", lty = 2, lwd= 2) +
geom_point(alpha = .4, colour = "midnightblue") +
theme_minimal()+
labs(x="Game Sales",
y= "Predicted Game Sales",
title = "Tuned Random Forest Regression")
-
Our results show that user count is the most important variable in predicting global scales following by critic counts. It is a basic relationship actually, more games, more ratings and more critics you have.
-
On the other hand, PC platforms constitute an important predictor. It might be due to ease of writing critics or sharing ratings online. The second platform types that explains variation is the Playstation, which also constitutes the main part of the user share.
-
The age of the game is also an important variable. If people enjoy the game, the popularity and fame of the game increase during the time, which leads more sales.
-
Among genres, sports games have the leading role in contributing global game sales. T rating is also the most power rating scores explaining the variation.
-
Finally, our model is fine to predict global game sales with these available features in data set. This model could be improve with further modification, particularly focusing on outliers. Similarly, to include more observation, computing missing variables are aso another method to improve the model.