Citation: Purwandari, K., Sigalingging, J. W., Cenggoro, T. W., & Pardamean, B. (2021). Multi-class weather forecasting from Twitter using machine learning approaches. Procedia Computer Science, 47-54.
Key Idea
The researchers’ goal in this study was to classify intra-day weather at an hourly timescale, by using data mining techniques to extract keywords from the social media platform Twitter, and combined with machine learning (ML) technology and text recognition methods, a multi-class weather classification system for the country of Indonesia is developed. Several keywords were used to classify weather patterns throughout the country, such as sunny, cloudy, thunderstorms, etc.. Three ML methods were used in the study, support vector machine (SVM), multinomial Naive Bayes (MNB), and logic regression (LR) to compare their effectiveness in meeting the goal of the study.
Background and Top Technical Innovations
In Indonesia, rainfall plays a big role in Indonesian’s way of life, and conversely, the efficient use of rainwater is a critical factor to the sustainment and growth of the local economy. Such is the case with farmers in which the flourishing of their crops depends on farmers being able to predict the most optimal time to plant and harvest their crops. With Indonesia having only two seasons, the rainy season, and the dry season; the rainy season plays a big role in the day-to-day lives of Indonesians, with rain driving their daily choices from when to begin planting crops to the best time of day to run errands.
According to their research, traditional intra-day forecasting is unreliable, particularly at the hourly timescale where climate models fail to re-create sub-daily precipitation in a dependable manner. The researchers sought to use modern tools to fill this gap by using information provided by the very people they are seeking to serve, in an innovative way producing an information closed-loop cycle of information gathering in the form of “tweets” from the locals, to processing the information with ML algorithms, to disseminating a weather prediction back to the locals via some kind of media outlet, e.g. Twitter (a future task).
The researchers followed a five-step approach to their innovation, shown in Figure 1:
Data was collected and gathered from Twitter via an Application Programming Interface (API) using the Twitter package in the R language for statistical computing. To extract only tweets stemming from Indonesia, the language ‘id’ code was used – a signature of an Indonesian tweet. Keywords sought in the tweet feature were sunny, cloudy, rainy, heavy rain, and thunderstorms. Several steps were taken in the preprocessing step to translate noisy and unstructured text data into structured text data, e.g., deleting emails, usernames, hashtags, deleting characters not belonging to the alphabet and words with no meaning according to stopword dictionaries (a “negative” dictionary used to filter out words).
Next, the term frequency inverse document frequency (TFIDF) step, which is a feature extraction process, is a statistical measure that reflects the importance of a word to a document, also known as the word’s weight. Term frequency is simply the number of times a term (word) appears in a document. Inverse document frequency is a factor that diminishes the weight of terms with high frequency and increases the weight of those terms with low frequency. The word’s weight can be factored into the ML classification algorithms depending on the method used, e.g., the Multinomial Naive-Bayes classifier is well suited for classification with discrete features, such as the word’s weight. .
The goal of the classification step is to group words into the defined discrete classes, i.e., the keywords. The various classes sought to be classified, makes this problem a multi-class / multi-dimensional dataset. The classification methods employed by the researchers include Support Vector Machine (SVM), Multinomial Naive-Bayes (MNB), and Linear Regression (LR); each method has distinct characteristics that make them suitable to different applications. Of major concern to machine learning practitioners, is the performance of the trained models they develop, as such each of the method’s performance was assessed and compared. SVM outperformed the other two methods, where its predictive accuracy measured at 93%, MNB’s and LR’s accuracy was 53% and 88%, respectively.
Conclusions and Potential Future Applications
SVM has proven to be an effective ML method in text classification applications where high-dimensional data may be present, such as the multi-class problem formulated and solved by the researchers in this study. The study applied this method to classifying text extracted from the social media platform Twitter for the purpose of obtaining intra-day weather data across the islands of the Indonesian country. This is valuable information especially if tied to geoposition fixing, where residents can know where the weather is occurring and the extent of the severity. The next logical step is to deploy such a system to get feedback on its utility from those that its intended to serve, the Indonesian people. In addition to using such a system to alert users of hourly-weather data, such a system can be applied to many other sectors where hourly or even minute-scale information will make all the difference in people’s lives. An extreme example that may be fresh on people’s minds, is the second largest deadliest school shooting in U.S. history, the Uvalde school shooting. According to the media, leading up to the assault, the gunman had posted several “red-flag” type messages on social media, that if a similar ML system trained to search for such indicators, could be used to alert authorities of a potential threat. Such technological advancements can be the necessary catalyst needed to begin the culture-shift from a society that is reactive to one that is proactive.