by Florent IDARGO
Context
Hello!šš½
As one of the FLASS group member I will relate you our first achievements and our plans to monitor the next steps of
our project. We already have defined our strategy at the blog creation but as a quick reminder: our goal is to produce
arbitrage opportunities on listed video games studios through their video games trailers YouTube comments sentiments.
Sample Determination š
As we have the ambitious goal of predicting a stock price future trend, or at least develop a reliable indicator,
we needed first to get a training and a test sample. Video games studios are very diverse in their structure and
diversification. Even though they are listed, they have not intensively benchmarked each otherās management structures
and therefore maintain their strong, owns identities.
Front of this complex, international market, we have made the choice to build our data on the 25 most sold games of 2018,
making the assumption they were representative of the listed developersā revenues and touch the main revenue sources.
To prevent corrupted data weāve check one by one the linked game studios to understand any exogeneous event in the reveal
trailer publication period which might have biased the data. This made us eliminate few games that we replaced with lower
ranking game which fit our requirements.
Tool decision ā
As 100% of our data come from YouTube, trying to scrape the Web site through libraries such as Scrapy would have been quickly denied by the platform as the number of requests would be too high. Although ways to bypass the rules a web site set in its robot.txt file exist, they look a bit shady/illegal, so we took another path, using the YouTube api.
Youtube API potential š§š¼āāļø
Going for the YouTube api was our best choice. As we began to scroll through the official documentation it was certainly messy but the api already had a community of programmers, tutorials and external documentation. This was a precious asset and it allowed us to quickly get in the hard work of getting data to study the feasibility of our project.
Our necessity of getting things done was well supported by Alphabetās structures which delivers credentials to its api service to each of its google accounts, without any delay. By this time we had every tool needed (documentation to learn and api key to implement) all that was left was to get the core job done: writing the software.
Software Development āØļø
Data download limit
As we iterated through our game sample to collect their comments data, we faced a first issue, the downloading limit.
YouTube was only letting us download 10ā000 comments per day, but most of our selected trailers had this amount of comments (note: the Youtube API only select the first comments, so it does not take into account the replies to a comment). Therefore, this would imply 25 days of data downloading, which is more than the time we have. As a compromise to this limit we have all decided to download every day the data from one game which allows us to get most of our YouTube data, going 5 time faster
Here just under a sample of the output we had on NBA 2K19, the emojis handling will be treated in future posts.
Videosā Comments Heterogeneity
Our next issue was the heterogeneous quantity of comments under each reveal trailers. Most games have around 10ā000 comments, one 40ā000 comments and a few under 5ā000ā¦
We had two solutions: we could either weight each gamesā sentiment output in our model according to its comments number, or we could take more videos from games having fewer comments.
The second solution seemed better as we are not yet statisticians, we could bias the result through computation mistakes andā¦ As the adage says āgarbage in -> garbage outā we need comparable games, if a game has too few comments it might be noise in our analysis.
Thus, games having less than 5ā000 comments wonāt be in our analysis and will be replaced by more commented trailer from 2018 games.
Coming work š
From our previous work we still need to complete our data with few more games compatible with our prerequisitesā¦ We expect to reach this work by the end of the week.
We will relate the text pre-processing and exploratory data analysis in our next blogpost.
Our last post will be about the mood learning and portfolio management.
Stay in touch with us !