Stack Overflow dev survey - Cleaning up numbers
If you haven't read my last post, I am taking a look a Stack Overflow's yearly developer survey. After the survey is complete, the release the data collected publicly, which can be found here.
At first, I was only looking at numbers from the 2017 survey but there were a few questions I wanted to ask using survey numbers from 2011-2017, such as the ratio of Male/Female developers, among those responded to the developer survey and the change in salary of developers over the last 7 years.
Which is when I hit a roadblock - the actual survey data. Real life data is a mess, starting from the weird formats it is saved in to the weird data saved in them. For example, the developer's country/region has been a consistent question over the years but the possible answers have changed e.g. from "United States of America" to "United States". The column name for this data has changed as well, from "country" to "Country" to "What country or region do you live in?" etc.
This is why data analysts in the real world spend a lot more time cleaning and massaging the data into the right form/format before actually getting to do any analysis. And this is exactly what I ended up doing this weekend.
The Python scripts I wrote to cleanup the datasets can be found on GitHub. As you can see, I use the Pandas library to load and cleanup the dataset, finally saving the cleaned dataset. I also got a chance to get a broad overview of the kind of questions I have answers for from the survey.
A preliminary look at the datasets an be found in this jupyter notebook. I briefly and superficially look at the distribution of devs across the world. what their salaries are and what their gender is, among those devs who responded to the survey.
Tell me if have any interesting questions that can be answered using these datasets. Until next time ...
At first, I was only looking at numbers from the 2017 survey but there were a few questions I wanted to ask using survey numbers from 2011-2017, such as the ratio of Male/Female developers, among those responded to the developer survey and the change in salary of developers over the last 7 years.
Which is when I hit a roadblock - the actual survey data. Real life data is a mess, starting from the weird formats it is saved in to the weird data saved in them. For example, the developer's country/region has been a consistent question over the years but the possible answers have changed e.g. from "United States of America" to "United States". The column name for this data has changed as well, from "country" to "Country" to "What country or region do you live in?" etc.
This is why data analysts in the real world spend a lot more time cleaning and massaging the data into the right form/format before actually getting to do any analysis. And this is exactly what I ended up doing this weekend.
The Python scripts I wrote to cleanup the datasets can be found on GitHub. As you can see, I use the Pandas library to load and cleanup the dataset, finally saving the cleaned dataset. I also got a chance to get a broad overview of the kind of questions I have answers for from the survey.
A preliminary look at the datasets an be found in this jupyter notebook. I briefly and superficially look at the distribution of devs across the world. what their salaries are and what their gender is, among those devs who responded to the survey.
Tell me if have any interesting questions that can be answered using these datasets. Until next time ...