Skip to main content

Pandas download statistics, PyPI and Google BigQuery - Daily downloads and downloads by latest version

Inspired by this blog post : https://langui.sh/2016/12/09/data-driven-decisions/, I wanted to play around with Google BigQuery myself. And the blog post is pretty awesome because it has sample queries. I mix and matched the examples mentioned on the blog post, intent on answering two questions - 

1. How many people download the Pandas library on a daily basis? Actually, if you think about it, it's more of a question of how many times was the pandas library downloaded in a single day, because the same person could've downloaded multiple times. Or a bot could've.

This was just a fun first query/question.

2. What is the adoption rate of different versions of the Pandas library? You might have come across similar graphs which show the adoption rate of various versions of Windows.

Answering this question is actually important because the developers should have an idea of what the most popular versions are, see whether or not users are adopting new features/changes they provided in a new version, and so on.

https://bigquery.cloud.google.com/dataset/the-psf:pypi is where you can explore the metadata and enter/run the queries you are interested in.

Daily download stats

The query is


SELECT
  STRFTIME_UTC_USEC(timestamp, "%Y-%m-%d") AS yyyymmdd,
  COUNT(*) as total_downloads,
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -61, "day"),
    DATE_ADD(CURRENT_TIMESTAMP(), -1, "day")
  )
WHERE
  file.project = 'pandas'
GROUP BY
  yyyymmdd
ORDER BY
  total_downloads DESC

For those of you not familiar with the SQL-like syntax, the above query creates a new table which contains the dates in one column and the total number of downloads in the other. The download statistics pertain to the Pandas library. And we asked for 60 days worth of data.

After running the query, downloading the data set and plotting, you should see something similar to



If i'm not wrong, the regular dips are on the weekends. There are atleast two weeks which have strong dips mid-week. I haven't explored the dataset further but I wonder why.

Adoption rate of new updates/releases

The query is


SELECT
  STRFTIME_UTC_USEC(timestamp, "%Y-%m") AS yyyymm,
  file.version,
  COUNT(*) as total_downloads,
FROM
  TABLE_DATE_RANGE(
    [the-psf:pypi.downloads],
    DATE_ADD(CURRENT_TIMESTAMP(), -1, "year"),
    CURRENT_TIMESTAMP()
  )
WHERE
  file.project = 'pandas'
GROUP BY
  file.version,
  yyyymm
ORDER BY
  total_downloads DESC
LIMIT 1000

This time, the dataset contains one column pertaining to month, a second pertaining to the version of the Pandas and the third containing total downloads of a particular version of the library on a particular month. All in all, we ask for a full year's worth of download data i.e 12 rows, one for each month's cumulative.

Again, running the query, downloading the data and plotting it should give you something that looks like -



Actually. I lied. Directly plotting the dataset is going to look like a mess because there are a lot more versions than the 7 plotted above. I chose those 7 because they are the most recent versions. And, this too shows some interesting trends. It's interesting to note that Pandas 0.17.x and 0.18.x still have loyal/lazy users.

That's all for now. And I'm also being lazy by not posting the code-snippets I used to generate the plots from the data. But, I thought, this blog post would be a start so, I'll try to write a followup which links to the datasets and the code. Until then ...

Popular posts from this blog

Animation using GNUPlot

Animation using GNUPlotI've been trying to create an animation depicting a quasar spectrum moving across the 5 SDSS pass bands with respect to redshift. It is important to visualise what emission lines are moving in and out of bands to be able to understand the color-redshift plots and the changes in it.
I've tried doing this using the animate function in matplotlib, python but i wasn't able to make it work - meaning i worked on it for a couple of days and then i gave up, not having found solutions for my problems on the internet.
And then i came across this site, where the gunn-peterson trough and the lyman alpha forest have been depicted - in a beautiful manner. And this got me interested in using js and d3 to do the animations and make it dynamic - using sliders etc.
In the meanwhile, i thought i'd look up and see if there was a way to create animations in gnuplot and whoopdedoo, what do i find but nirvana!

In the image, you see 5 static curves and one dynam…

on MOOCs.

For those of you who don't know, MOOC stands for Massively Open Online Course.

The internet is an awesome thing. It's making education free for all. Well, mostly free. But it's surprising at the width and depth of courses being offered online. And it looks like they are also having an impact on students, especially those from universities that are not top ranked. Students in all parts of the world can now get a first class education experience, thanks to courses offered by Stanford, MIT, Caltech, etc.

I'm talking about MOOCs because one of my new year resolutions is to take online courses, atleast 2 per semester (6 months). And I've chosen the following two courses on edX - Analyzing Big Data with Microsoft R Server and Data Science Essentials for now. I looked at courses on Coursera but I couldn't find any which was worthy and free. There are a lot more MOOC providers out there but let's start here. And I feel like the two courses are relevant to where I …

On programmers.

I just watched this brilliant keynote today. It's a commentary on Programmers and the software development industry/ecosystem as a whole.



I am not going to give you a tl;dr version of the talk because it is a talk that I believe everyone should watch, that everyone should learn from. Instead, I am going to give my own parallel-ish views on programmers and programming.
As pointed out in the talk, there are mythical creatures in the software development industry who are revered as gods. Guido Van Rossum, the creator of Python, was given the title Benevolent Dictator For Life (BDFL). People flock around the creators of popular languages or libraries. They are god-like to most programmers and are treated like gods. By which, I mean to say, we assume they don't have flaws. That they are infallible. That they are perfect.
And alongside this belief in the infallibility of these Gods, we believe that they were born programmers. That programming is something that people are born wit…