Skip to main content

3 Projects i need help with

I've wanted to learn methods to Extract and Visualize data and i've come up with 3 projects that i could use to learn these methods! I've been working on them for a couple of months now, doing background work to remove all of the trivial problems!

Mining my blog, Mining Facebook Data and Mining my bookmarks - these are the three projects i have in mind and i need help with.


Mining my blog


If you guys don't know, i have another blog which automatically posts videos i add to my Youtube Favorites list - using IFTTT. I've been using this service for over an year now and i have ~600 posts - all of them videos on my blog. Now, i want to mine the blog to look at the frequency with which i watch youtube videos, maybe see a pattern which i can correlate to my college exam schedule!


While looking for ways to mine the blog, i found out that i can download my blog as an xml file! So, getting the data part is taken care of. Here's the xml file. Download it and open it using a browser and you'll see that each post comes with a date and time stamp! Now open the XML file using any editor and look for this specific data and time stamp! I haven't learnt XML parsing yet, hopefully in python, but as i understand it - i can extract these stamps! Once extraction is done, it's just a matter of plotting a histogram!


Mining my bookmarks


I use bookmarks. A lot! I even use Xmarks to backup all of them online and to share them across browsers and across OS. Now, something bad happened - probably because I use chrome and Chrome has Chrome Sync which syncs bookmarks as well - and i now have multiple copies of the same bookmark. You can see symptoms of this here and the full blown tumor that it has now become here! (my current set of bookmarks, saved as a .html)


Now, first thing to do is to remove the multiple copies of bookmarks. There is no trivial solution for this, as far as i've looked! So, any suggestions or solutions are welcome here!


Ohh, as a perk, once i remove the multiple copies, i am going to mine this list as well!


If you notice the source of the bookmarks page i.e the .html file - you will notice that each bookmark has a tag - ADD_DATE="1370996258"- which i'm guessing is the date on which i created the bookmark! I'm now trying to decipher the number to know the date (and maybe time)! 


To decipher the number, i've created an folder called example and I've been created bookmarks in it on regular intervals - roughly once every 2 hours over a weekend! I'm in the process of deciphering this tag so if you can help me remove the extra copies, i'll have sometime interesting in hand! 


Mining my Facebook Data


This is kinda like what Wolfram Facebook Analytics does, actually a very dumbed down version of it! If you guys don't know, you can download all of the data facebook has on you - information, friend list, photos, videos and what not - from here. Facebook used to provide a html file which contained information on posts you've made, the # of likes on each post, the # of comments and so on but they don't do it anymore! 


But anyway, i have copies of data from from Nov 2011 and Dec 2012. These are html files corresponding to my wall from Nov 11 and Dec 12. Now, again, if you turn your attention to the source of the wall i.e the .html file, you will see a pattern - a pattern as to where my posts, where the comments and likes for a certain post are! 


upon inspection, you'll see that

<abbr class="time published" title="2012-12-14T23:09:08+0000">December 15, 2012 at 4:39 am</abbr> is the html tag for a post.Clearly, it has the date and the time of the post.

further like <div class="feedentry hentry"...>...</div> is the tag for a post,

it's <div class="comment hentry"...>...</div> for a comment pertaining to the post and
<div class="comments hfeed"...>...</div> is the tag for the number of people who liked the post!

So, again, with the trivial things cleared, i now need to learn parsing html and xml files (using python) and learn how to go about extracting data from these specific tags!


I can then have a time series plot of my posts, the frequency, average # of likes.

Going further, i could look for a correlation between words in a post and the # of likes for the post. Well, you get it, the possibilities to screw around with this data is just limitless!

So, there you go. 3 projects which i hope to finish by the end of 2013! But hey, if you're interested, you're welcome to work on these files! Or you can do the same with your data!


Any comments on how to go about doing these projects or actual solutions are very very welcome and highly appreciated!


Happy Brainstorming! :) 

Popular posts from this blog

Animation using GNUPlot

Animation using GNUPlotI've been trying to create an animation depicting a quasar spectrum moving across the 5 SDSS pass bands with respect to redshift. It is important to visualise what emission lines are moving in and out of bands to be able to understand the color-redshift plots and the changes in it.
I've tried doing this using the animate function in matplotlib, python but i wasn't able to make it work - meaning i worked on it for a couple of days and then i gave up, not having found solutions for my problems on the internet.
And then i came across this site, where the gunn-peterson trough and the lyman alpha forest have been depicted - in a beautiful manner. And this got me interested in using js and d3 to do the animations and make it dynamic - using sliders etc.
In the meanwhile, i thought i'd look up and see if there was a way to create animations in gnuplot and whoopdedoo, what do i find but nirvana!

In the image, you see 5 static curves and one dynam…

on MOOCs.

For those of you who don't know, MOOC stands for Massively Open Online Course.

The internet is an awesome thing. It's making education free for all. Well, mostly free. But it's surprising at the width and depth of courses being offered online. And it looks like they are also having an impact on students, especially those from universities that are not top ranked. Students in all parts of the world can now get a first class education experience, thanks to courses offered by Stanford, MIT, Caltech, etc.

I'm talking about MOOCs because one of my new year resolutions is to take online courses, atleast 2 per semester (6 months). And I've chosen the following two courses on edX - Analyzing Big Data with Microsoft R Server and Data Science Essentials for now. I looked at courses on Coursera but I couldn't find any which was worthy and free. There are a lot more MOOC providers out there but let's start here. And I feel like the two courses are relevant to where I …

On programmers.

I just watched this brilliant keynote today. It's a commentary on Programmers and the software development industry/ecosystem as a whole.



I am not going to give you a tl;dr version of the talk because it is a talk that I believe everyone should watch, that everyone should learn from. Instead, I am going to give my own parallel-ish views on programmers and programming.
As pointed out in the talk, there are mythical creatures in the software development industry who are revered as gods. Guido Van Rossum, the creator of Python, was given the title Benevolent Dictator For Life (BDFL). People flock around the creators of popular languages or libraries. They are god-like to most programmers and are treated like gods. By which, I mean to say, we assume they don't have flaws. That they are infallible. That they are perfect.
And alongside this belief in the infallibility of these Gods, we believe that they were born programmers. That programming is something that people are born wit…