Warning this is a data nerd post!

I met with a good friend yesterday to discuss various challenges we’ve been facing in strive.ai.  He’s an uber experienced data rock star from my previous company.  One of the things we discussed was using machine learning model evaluation tools not just in development but in production.  Model evaluation is essentially testing the accuracy of the predictions a model yields.

We did this in the very early days when we had < 5 users banging on the system.  We did it to test which type of algorithm we should use in each specific situation and once we felt satisfied we implemented the best one(s) and didn’t consider doing a test each time we train the models in the actual flow of the application as it gets new users and new data.

What I was looking for in our discussion was a way to tell when an algorithm was ready to predict accurately in hopes of avoiding making false predictions while a users data was being improved over time.  In some tests this morning I was able to establish that yes, the model evaluation scores for our new users isn’t up to snuff and as a result we’re sending alerts we could be avoiding and instead training models on a more frequent basis for users with poor model scores until they are sufficiently accurate and training less frequently for users with high model scores to save performance and hits against Strava.

The other thing we can do is test all of the models used in our ensemble situations (when we use more than one prediction to be sure) and only activate the ones that have high accuracy score.  This will allow the system to customize itself to each user since most of the actual models are done on a per user basis using their activity or threshold data.  Soon we’ll be doing models on a per segment basis which will be even more important to validate since we’ll be attempting to predict performance based on weather, power, etc.

Bottom line is adding the real time model evaluation is a milestone pivot for our apps architecture.   The change will come in two steps:  We’ll first start to baseline the accuracy of every user specific model in the system and from there we’ll choose an accuracy threshold by which models will be turned / off sometime next week once we can assess the results.  It won’t be too hard to implement given the good choices we’ve made thus far and with luck we’ll have step one in place by end of the weekend.  If your interested in how we evaluation models (in this case classification models) you can click on the image below to visit a definition of precision and recall.

Happy Friday and thanks for continuing to help make this a cool Strava app!

Precision Recall

Precision and Recall

5 thoughts on “Precision and Recall

  • June 25, 2017 at 3:13 pm

    Cool stuff dude!

    • June 29, 2017 at 4:21 pm

      Thanks Craig! Glad to hear you like it!

  • June 28, 2017 at 9:09 am

    Does this mean that various models would be trained and tested on each user, and different accuracy mesures may be given to different users since they have been tested on different models?

    • June 29, 2017 at 4:20 pm

      Ian, in short yes. Many of the algorithms are unique to each user (e.g. Bike Classifiers) and soon there will be multiple for each Strava segment. We’re experimenting with the quality thresholds for each type of estimator now and will dial them in as we see what level works best for each. The quality threshold at the present is set on an algorithm basis so everyone shares the same quality threshold for the same predictor class. As of today only algorithms that have tested past the threshold for that type of prediction will trigger for you in the system. We’re still trying to establish the best way to decide which prediction is the best one when we attempt multiple for a specific purpose. This is called ensemble learning and we should be able to get the machine to make the best judgement in the coming weeks. Thanks for helping us get this tight!

Comments are closed.