Data mining my Spotify history

Published April 29, 2021 on Chandler Swift's Blog Source

I have a few songs I listen to very frequently. Enough that I wonder, “Does the amount I stream this song have a noticeable effect on the song’s popularity?” It turns out, yes, there are at least a handful of songs for which that is the case.

First, the main findings:

Song Listens/Total Percentage of total
Nun komm, der Heiden Heiland, BWV 659 67/<10001 6.700%
Come To Mama 73/1118 6.530%
10 Pièces - Organ: Scherzo 68/1156 5.882%
Dandaya 83/1753 4.735%
How Long Does It Take? 44/<10001 4.400%
The Eye of the Hurricane 42/<10001 4.200%

Neat! I have four (and possibly more1 2) songs for which I account for at least 4% of the Spotify listens. Interestingly, none of these are among my top played tracks, some of which have hundreds of plays.

That said, there are two ways to get a high listen percentage: have my personal total be very high, or have the global total be very low. Given this, it’s not particularly surprising that a large number of the songs for which I have a high listen percentage have a very low total number of listens.


The full source code I wrote and used is available on Github.

I have my listening history mostly recorded on, and Spotify provides total listen counts for songs, at least through their web client. Here’s an abbreviated runthrough of the code I used to find the data, ignoring most of the error handling and boilerplate.

#!/usr/bin/env python3

import pylast
import spotify.sync as spotify

Python has nice packages to do the heavy lifting here: pylast for and for Spotify.

lastfm_top_tracks = lastfm_client.get_user("chandlerswift").get_top_tracks(stream=True, limit=None)

First, we get a generator that returns (lazily – no need to pull all 11000ish of my listens in one go!) the tracks I’ve played the most frequently, in descending order. (If you want to try this, note that limit=None does require pylast/pylast#367 to be merged.)

for i, lastfm_track in enumerate(lastfm_top_tracks):
    spotify_track =
        f"{} {lastfm_track.item.title}", types=["track"], limit=1

Then, for each track we’ve retrieved from, we find its corresponding track on Spotify. This can be issue-prone; tracks don’t necessarily have identical names between services, and often the same artist will have the same track across many albums. Apparently considers these to be the same, while Spotify counts them differently. Despite these problems, this generally seemed to work well.

    res = requests.get(f"http://localhost:8080/albumPlayCount?albumid={}").json()

Here’s some magic! It turns out Spotify doesn’t provide a way to retrieve play count information from their API, so we have to use a third party tool. (I could likely have figured out what calls this tool makes, exactly, and integrated it, but that seemed more complex than integrating an extra tool into a one-off workflow.) I downloaded a .jar file from sp-playcount-librespot’s latest release, ran it, and directed my API requests there.

    found_track = None
    for spotify_disc in res['data']['discs']:
        for spotify_track_info in spotify_disc['tracks']:
            if spotify_track_info['name'].lower() == lastfm_track.item.title.lower():
                found_track = spotify_track_info

We end up having to extract the Spotify track from that album, since it doesn’t seem to be possible to get data for an individual track from sp-playcount-librespot. This naïve comparison did have some issues, but for the most part it worked fairly well.

    if found_track:
        found_track['my_playcount'] = lastfm_track.weight
        print(f"{i}. {found_track['name']}: {found_track['my_playcount']}/{found_track['playcount']} ({100*found_track['my_playcount']/found_track['playcount']:.3f}%)")
        print(f"No track {lastfm_track.item.title} found on album {} (will find later)")

To each track, we tack on its “weight” (play count) from, and save it for later. We do a note of songs that we couldn’t find on the album—I’ll clean those up later. In the end, I wound up effectively running through this whole process again on the initially failed tracks, with a manual track comparison instead of an automatic one.


At this point, we’re done gathering the data; let’s see what we have from it. I ran the script with ipython -i, so when I was done, it just dropped me into an ipython shell to manipulate the data as I wanted. I sorted it by percentages, and printed them out in the format used to generate the table at the beginning of the article:

track_data.sort(key=lambda track: track['my_playcount']/max(track['playcount'], 1000), reverse=True)
for i, track in enumerate(track_data[:50]):
    print(f"[{track['name']}]({track['uri'].split(':')[2]}) | {track['my_playcount']}/{track['playcount'] if track['playcount'] > 0 else '<1000'} | {track['my_playcount']/max(track['playcount'], 1000)*100:.3f}%")
[Nun komm, der Heiden Heiland, BWV 659]( | 67/<1000 | 6.700%
[Come To Mama]( | 73/1118 | 6.530%
[10 Pièces - Organ: Scherzo]( | 68/1156 | 5.882%
[Dandaya]( | 83/1753 | 4.735%
[How Long Does It Take?]( | 44/<1000 | 4.400%
[The Eye of the Hurricane]( | 42/<1000 | 4.200%
[It's A Shame, It's A Mystery]( | 63/1757 | 3.586%
[Behind My Back]( | 36/1095 | 3.288%
[Lucky Southern (Live)]( | 27/<1000 | 2.700%
[Pussy Cat Moan]( | 80/3103 | 2.578%
[You're Nobody 'Til Somebody Loves You]( | 24/<1000 | 2.400%
[Imagine]( | 29/1351 | 2.147%
[Come Along and Join Me]( | 20/<1000 | 2.000%
[Sweet Inspirations]( | 23/1175 | 1.957%
[Soul Shine]( | 57/2954 | 1.930%
[How High the Moon]( | 18/<1000 | 1.800%
[I'm Happy With Me]( | 18/<1000 | 1.800%
[I Want To Be Happy]( | 18/<1000 | 1.800%
[I Don't Want To Hurt You Baby]( | 52/3042 | 1.709%
[You Gotta Move]( | 54/3374 | 1.600%
[To Dream The Impossible Dream]( | 15/<1000 | 1.500%
[Mannenberg - Pts. 1 & II (Feat. Sons of Table Mountain)]( | 37/2471 | 1.497%
[Jump Blues Jam Track in D_160 bpm]( | 38/2670 | 1.423%
[You'll Never Walk Alone]( | 18/1267 | 1.421%
[All Things Are Possible]( | 14/<1000 | 1.400%
[Gospel Beat]( | 14/<1000 | 1.400%
[Business is Tough (in Db)]( | 183/13218 | 1.384%
[Schefel]( | 14/1047 | 1.337%
[If I Only Had a Brain]( | 37/2792 | 1.325%
[Milestones]( | 13/<1000 | 1.300%
[Let It Be]( | 17/1338 | 1.271%
[Centerpiece]( | 20/1590 | 1.258%
[Hero]( | 11/<1000 | 1.100%
[Why Did You Leave My Child?]( | 11/<1000 | 1.100%
[A Chance To Breathe]( | 45/4325 | 1.040%
[Live In The Spirit]( | 10/<1000 | 1.000%
[Jesus, Oh What a Wonderful Child (In the Style of Mariah Carey) [Karaoke Version]]( | 23/2407 | 0.956%
[Songs of Praise Toccata for Organ]( | 26/2829 | 0.919%
[Bye Bye Blackbird]( | 9/<1000 | 0.900%
[Leave the Door Open]( | 28/3226 | 0.868%
[I Wish]( | 17/2041 | 0.833%
[The Walking Wounded]( | 26/3175 | 0.819%
[Suite brève: IV. Dialogue sur les mixtures]( | 8/<1000 | 0.800%
[It Had To Be You]( | 8/<1000 | 0.800%
[Atlanta Blue]( | 8/<1000 | 0.800%
[I'm A Woman]( | 34/4500 | 0.756%
[Honey It's Your Fault]( | 28/3862 | 0.725%
[Bathtub Blues]( | 7/<1000 | 0.700%
[Cookin' At The Colonels]( | 7/<1000 | 0.700%
[Let's Have a Natural Ball]( | 7/<1000 | 0.700%

A few more questions: Of the songs I’ve listened to at least 5 times, how many have <1000 listens?

[f"{i}. {t['name']} by {', '.join([a['name'] for a in t['artists']])}" for i, t in enumerate(list(filter(lambda track: track['playcount'] == 0, track_data)))]
['0. Nun komm, der Heiden Heiland, BWV 659 by Johann Sebastian Bach, Matti Hannula',
 '1. How Long Does It Take? by Sista Monica Parker',
 '2. The Eye of the Hurricane by D Squared',
 '3. Lucky Southern (Live) by Thirteen Degrees',
 "4. You're Nobody 'Til Somebody Loves You by Swingin' Fireballs",
 '5. Come Along and Join Me by The Chancellors Quartet',
 '6. How High the Moon by Les DeMerle, Bonnie Eisele',
 "7. I'm Happy With Me by Sista Monica Parker",
 '8. I Want To Be Happy by The Carl Fontana - Arno Marsh Quintet',
 '9. To Dream The Impossible Dream by Sista Monica Parker',
 '10. All Things Are Possible by Sista Monica Parker',
 '11. Gospel Beat by Sista Monica Parker',
 '12. Milestones by The Carl Fontana - Arno Marsh Quintet',
 '13. Hero by Sista Monica Parker',
 '14. Why Did You Leave My Child? by Sista Monica Parker',
 '15. Live In The Spirit by Sista Monica Parker',
 '16. Bye Bye Blackbird by The Carl Fontana - Arno Marsh Quintet',
 '17. Suite brève: IV. Dialogue sur les mixtures by Jean Langlais, John Balka',
 '18. It Had To Be You by The Carl Fontana - Arno Marsh Quintet',
 '19. Atlanta Blue by Bill Walker, The Bill Walker Orchestra',
 '20. Bathtub Blues by Joe Scruggs',
 "21. Cookin' At The Colonels by Steve Einerson",
 "22. Let's Have a Natural Ball by The Blue In Blues",
 '23. Now the Green Blade Rises (arr. P. Manz for pipe organ) by J. M. C. Crum, Paul Manz',
 '24. Too Many Drivers at the Wheel by AJ Crawdaddy',
 '25. Singet frisch und wohlgemut op. 12,4 - II. by Hugo Distler, MonteverdiChor Muenchen, Konrad von Abel',
 '26. Peas Porridge Hot by Joe Scruggs',
 '27. Organ Symphony No. 1 in D Major, Op. 14: VI. Final by Louis Vierne, Fabien Chavrot',
 '28. Old Devil Moon by Michael Gott',
 "29. It's a Beautiful Day in the Neighborhood by Rich Szabo, Curtis McKonly, Mark Vinci, Bill Kirschner",
 '30. Dr. Mlk & Obama Impossible Dream Tribute by Sista Monica Parker',
 '31. Old Devil Moon by Michael Gott']

Only 32 of the 1989 tracks we’re inspecting; sounds like my tastes aren’t too obscure!

Spotify gives me the length of each song. What songs have I listened for the longest total time?

import datetime

track_data.sort(key=lambda track: track['my_playcount'] * track['duration'], reverse=True)
for i, track in enumerate(track_data[:10]):
    print(f"{i}. {track['name']}: {track['my_playcount']} listens at {datetime.timedelta(seconds=round(track['duration']/1000))}: {datetime.timedelta(seconds=round(track['my_playcount'] * track['duration'] / 1000))}")
0. You Look Good To Me: 690 listens at 0:04:52: 2 days, 7:58:46
1. Lucky Southern: 649 listens at 0:03:46: 1 day, 16:45:00
2. Change the World: 569 listens at 0:03:55 seco: 1 day, 13:07:19
3. Strasbourg / St. Denis: 473 listens at 0:04:39: 1 day, 12:36:30
4. Rock of Ages: 306 listens at 0:05:28: 1 day, 3:52:58
5. At Long Last Love - Live: 282 listens at 0:04:56: 23:12:39
6. Sultans of Swing: 229 listens at 0:05:50: 22:17:22
7. Mo' Better Blues (feat. Terence Blanchard): 281 listens at 0:03:39: 17:05:31
8. Here We Go Again: 231 listens at 0:03:58: 15:16:18
9. Love Me or Leave Me - 2013 Remastered Version: 270 listens at 0:03:21: 15:06:15

It looks like this list is still roughly in order of listen count, as one might expect. Most songs tend to be around the same length, so it makes sense that it wouldn’t be wildly different from the most-listened-to song list.

I did limit the script run to songs with at least 5 plays; however, a song with only 4 plays would need to be almost 4 hours long to make this list, so I think it’s safe to ignore those.

Potential improvements

A solution to many of the issues I’ve been having would be to use Spotify’s data on my listening history rather than’s.

  • Because I’d be correlating Spotify’s data with Spotify’s data, I wouldn’t have to worry about track title mismatches (the most notable being “¿Quién Será?” vs “Quien Sera?”; the most common being “[track title]” vs “[track title] - Remastered 20xx”). In this case, I could simply match the internal Spotify IDs (if that’s something their data dumps do actually provide).
  • Spotify should have my complete, precise listening history, compared to which only has the last few years’ worth.
  • Since I’m using Spotify’s search, rather than selecting a track by ID, occasionally I receive an obscure track on a weird album that isn’t what I’m looking for.

I did want to base this analysis on Spotify’s data, but I wasn’t particularly patient, and Spotify seems to try to barely scrape under GPDR’s 30-day deadline for delivering data exports.

And finally, this isn’t an improvement, but worth mentioning: While reading through Spotify’s API docs, I found that they expose an Audio Analysis for a Track (if the link doesn’t take you there, reload or search the page for “Get Audio Analysis for a Track” – anchors don’t always seem to work on the first page load), which includes information like time and key signature! This is probably enough data that I’d be able to write the music player feature of which I’ve long been dreaming: Create playlists with no key changes between songs!

Oh man, let me tell you all about it! Any song has a starting key signature and an ending key signature. For many songs, these are the same. Those are the easy ones. Start with a song in the key of D, say, and play more songs in the key of D. Here's an example from the last time I seriously thought about this:

The hard part comes in when songs have one or more modulations (changes in key signature, essentially) in the middle. As a relatively quick example, you could run through the following songs:

Track Artist Album Start Key End Key
This Will Be (An Everlasting Love) Natalie Cole Inseparable B♭ D♭
I Just Called To Say I Love You - Live/1995 Stevie Wonder The Complete Stevie Wonder D♭ E♭
You’re Nobody ‘Til Somebody Loves You Swingin’ Fireballs Live in Bremen E♭ G
Beyond the Sea (La Mer) George Benson 20/20 G A♭
My Buick, My Love and I - Bonus Track Seth MacFarlane, Elizabeth Gillies In Full Swing A♭ F
The Liberty Bell March John Philip Sousa Last Night of the Proms F B♭

Each one begins in the key the other left off in, creating what is (for me) a relatively seamless listening experience! I had abandoned this project years ago when I figured I’d have to manually classify every song I wanted to listen to, but if Spotify will do do it for me…problem succesfully avoided!

My ideal listening client, then, would be an improvement on the standard “shuffle” functionality. Instead of picking songs completely at random, it would pick song n+1 from the set of songs that begin in the same key that the song n ended in.

  1. Spotify appears to not show listen counts for songs with fewer than 1000 plays. It’s not impossible that some of the more obscure things I listen to (artists I know in person, or obscure organ recordings, likely) have only tens to hundreds of plays. However, unless I can find out otherwise, I’m considering them conservatively, as if each track without a count had exactly one thousand listens. ↩︎

  2. I only signed up for in 2017, so this won’t account for any listening I did before then. Also, for a while, I’m not convinced the integration was very good, so I think a fair number of my listens got skipped for the first year. ↩︎

I don't have a formal commenting system set up. If you have questions or comments about anything I've written, send me an email and I'd be delighted to hear what you have to say!