Tori's Capstone Journey

Preparing for the Finish Line: Strengthening SoundSoar’s Trend Prediction for the Final Presentation
Oct 26, 2024
5 min read
0
3
0
This month, I focused on enhancing the machine learning algorithm in my capstone project, SoundSoar. My main goal was to improve the predictive capabilities of the algorithm by integrating several new models. I successfully added Logistic Regression, SVM, Linear Discriminant Analysis (LDA), Extra Trees, and K-Nearest Neighbors (KNN). In addition, I implemented a search function that allows users to predict outcomes for songs that may not have prior popularity data, a crucial step in understanding how different songs can trend over time. Throughout this process, I ensured that I stored essential information for each model, including the starter script, README files, model pickle files, and CSV data. This makes it easier for my professors and other computer science professionals to access, use, and adapt my work as needed.
This month presented valuable lessons, particularly regarding the implications of relying on popularity history for model accuracy. When songs without this data are queried, models that emphasize popularity metrics struggle, leading to less reliable predictions. Currently, only the KNN and LDA models show significant feature importance for attributes like tempo and danceability. Moving forward, I recognize the need to dedicate time to enhance the models that leverage attribute-driven metrics or consider the feasibility of tracking popularity history for all Spotify tracks. By addressing these areas, I aim to improve the predictive capabilities of my project.

Model Performance Overview
This month, I utilized a variety of machine learning models to enhance the trend prediction capabilities of my project. Below is a list of the seven models implemented, along with their key features:
Logistic Regression
Focuses on predicting the probability of outcomes based on historical data.
Support Vector Machine (SVM)
Effective for classification tasks, particularly in high-dimensional spaces.
Linear Discriminant Analysis (LDA)
A statistical method used for feature extraction and dimensionality reduction.
Extra Trees
An ensemble method that uses a collection of decision trees to improve accuracy.
K-Nearest Neighbors (KNN)
A non-parametric method used for classification and regression, relying on distance metrics.
Random Forest
An ensemble technique that constructs multiple decision trees for robust predictions.
HistGradient Boosting
A boosting method that builds trees sequentially to optimize predictive performance.
Please see the performance results for these models below, providing a visual representation of their accuracy, precision, F1 score, and other relevant metrics.

The performance of the various models revealed notable results. Both RandomForest and HistGradientBoosting achieved an impressive accuracy of 0.86, effectively balancing depth, ensemble size, and regularization settings. Each of these models demonstrated consistency in predicting "down" and "stable" trends while maintaining minimal misclassifications. ExtraTrees followed closely with an accuracy of 0.83, showcasing low misclassification rates across classes. Its effective combination of parameters, including min_samples_leaf: 1 and criterion: gini, allowed it to capture subtle trend patterns. SVM and KNN also displayed competitive performance, with accuracies of 0.78 and 0.81, respectively. SVM managed complex trend separations moderately well, while KNN’s selection of n_neighbors: 21 helped in accurately identifying nearby trends. In contrast, Logistic Regression and LDA, while more interpretable, scored lower with accuracies of 0.69 and 0.70. Both models faced challenges in distinguishing between "up" and "down" classifications, highlighting their limitations in capturing the complexities of trend shifts inherent in the data.
The analysis of feature importance across the models sheds light on the key predictors influencing music trend predictions. In RandomForest, the top predictors identified were "standard deviation in popularity" and "velocity," emphasizing the significant roles that popularity consistency and song pace play in maintaining trend stability. For HistGradientBoosting, "velocity" and "current popularity" emerged as strong predictors, with the model benefiting from its ability to make nuanced, gradual improvements on these variables, thereby enhancing its trend prediction capabilities. Meanwhile, ExtraTrees highlighted the importance of "mean popularity" and "median popularity," effectively leveraging these averages to assess the stability of trend shifts. This focus on average metrics underscores how the models prioritize consistent popularity metrics in evaluating potential trends.

The feature importance analysis for the lower-performing models reveals distinct characteristics in how they capture trends. Support Vector Machine (SVM) exhibits a balanced importance across features, with "velocity" and "current popularity" slightly more prominent, indicating its balanced approach to capturing trend patterns. In contrast, K-Nearest Neighbors (KNN) is most influenced by "tempo" and "danceability," highlighting the model's reliance on rhythmic elements to group trend categories effectively. Logistic Regression shows that "mean popularity" and "current popularity" are the most significant factors, which aligns with its linear interpretation that emphasizes the current state and average popularity over other features. Finally, Linear Discriminant Analysis (LDA) prioritizes "velocity" and "danceability," suggesting that this model leans towards energy-related audio features when distinguishing between trend categories. This analysis provides valuable insight into how different models prioritize various musical elements and popularity metrics in their predictions.

The high-performing models, including RandomForest, HistGradientBoosting, and Support Vector Machine (SVM), consistently achieve top accuracy and F1 scores, underscoring their effectiveness in leveraging popularity metrics for trend prediction. In the early stages of research, the availability of popularity data was limited, resulting in stable performance across all models. However, as the dataset expanded, those models prioritizing popularity-based features continued to maintain high accuracy and precision. On the other hand, lower-performing models like Linear Discriminant Analysis (LDA) and K-Nearest Neighbors (KNN) do not significantly lag behind, yet their reliance on audio features such as "danceability" and "tempo" rather than popularity metrics limits their ability to effectively capture trend shifts. Overall, the consistent outperformance of models utilizing popularity-based parameters indicates that fluctuations in popularity are a strong driver of trend prediction in this context.

In addition to refining the machine learning models, I also concentrated on enhancing the project’s template work, specifically the new review information and the active model page. This page is designed to provide users with comprehensive insights into model performance, featuring key metrics such as accuracy, precision, and F1 scores, alongside detailed feature importance analysis. This streamlined presentation not only improves user experience but also facilitates easier access to crucial data for further evaluation.
To explore these updates,
please check out the active model page here: https://soundsoar.com/trending/model/info/
please check out the trend review page here: https://soundsoar.com/trending/review/
Retrospective
This month has been a pivotal period in the development of SoundSoar, allowing me to implement various machine learning models and significantly enhance the predictive capabilities of the algorithm. One of the standout achievements was the successful integration of multiple models, including Logistic Regression, SVM, LDA, Extra Trees, and KNN. This not only enriched the project but also provided diverse approaches to trend prediction, highlighting the effectiveness of popularity metrics. The implementation of the search function for songs lacking prior popularity data marked a crucial milestone in addressing gaps in model performance, showcasing my ability to adapt and find solutions to challenges.
However, I faced challenges regarding the reliance on popularity history for model accuracy. As I delved deeper into the project, I realized that the absence of popularity data for certain songs led to less reliable predictions. This highlighted a limitation in the models that prioritize popularity metrics, as their performance suffered without adequate data. Moving forward, I recognize the importance of dedicating time to enhance attribute-driven models, such as KNN and LDA, or exploring the feasibility of tracking popularity history for all Spotify tracks. This reflection emphasizes the need for continuous improvement and adaptation in the face of challenges, ultimately contributing to the project's success.
Throughout this course, I have effectively utilized my time by setting clear goals and actively working toward them. The skills I developed in previous courses, particularly in machine learning and data analysis, have been instrumental in shaping my approach to SoundSoar. Regular interactions with my advisor provided valuable insights, guiding me through complex decisions and enhancing the quality of my work. As I prepare for the presentation phase of my capstone project, I feel confident that I have built a solid foundation and am ready to showcase the progress I have made.