Warning this is a data nerd post!
I met with a good friend yesterday to discuss various challenges we’ve been facing in strive.ai. He’s an uber experienced data rock star from my previous company. One of the things we discussed was using machine learning model evaluation tools not just in development but in production. Model evaluation is essentially testing the accuracy of the predictions a model yields.
We did this in the very early days when we had < 5 users banging on the system. We did it to test which type of algorithm we should use in each specific situation and once we felt satisfied we implemented the best one(s) and didn’t consider doing a test each time we train the models in the actual flow of the application as it gets new users and new data.
What I was looking for in our discussion was a way to tell when an algorithm was ready to predict accurately in hopes of avoiding making false predictions while a users data was being improved over time. In some tests this morning I was able to establish that yes, the model evaluation scores for our new users isn’t up to snuff and as a result we’re sending alerts we could be avoiding and instead training models on a more frequent basis for users with poor model scores until they are sufficiently accurate and training less frequently for users with high model scores to save performance and hits against Strava.
The other thing we can do is test all of the models used in our ensemble situations (when we use more than one prediction to be sure) and only activate the ones that have high accuracy score. This will allow the system to customize itself to each user since most of the actual models are done on a per user basis using their activity or threshold data. Soon we’ll be doing models on a per segment basis which will be even more important to validate since we’ll be attempting to predict performance based on weather, power, etc.
Bottom line is adding the real time model evaluation is a milestone pivot for our apps architecture. The change will come in two steps: We’ll first start to baseline the accuracy of every user specific model in the system and from there we’ll choose an accuracy threshold by which models will be turned / off sometime next week once we can assess the results. It won’t be too hard to implement given the good choices we’ve made thus far and with luck we’ll have step one in place by end of the weekend. If your interested in how we evaluation models (in this case classification models) you can click on the image below to visit a definition of precision and recall.
Happy Friday and thanks for continuing to help make this a cool Strava app!