dev.to is a wonderful blogging platform that emerged a few years ago. I love writing for it and reading content published there. But what I like the most, and I think what everybody like the most is the community that was built on the platform.
A community is known to interact a lot with the poster through different kind of like and comments. There is no “karma” on dev.to, but one way to measure the popularity, the score, of a post is by looking a the number of interactions this post had with the community.
The number of comments, and of course the number of likes, which on the platform are divided into 3 categories: Unicorn 🦄, Like ❤ and bookmark 📕.
I recently wondered if an article posted at a certain time of the day performed better than others. And, if yes, what was the optimal time to post a blog post in order to be read by as many people as possible. I have some intuition, but I wanted to have proof and facts to work with.
Here is what I did:
Gathering the data:
I will be short here as I’ll write a longer post in the future to explain in detail how to efficiently gather this type of data.
I recently noticed, looking at the dom, that every article had a public id available.
I also knew that there is a public endpoint that allow you to fetch user information that look like this:
So naturally I tried to do the same with article and …
HTTP/1.1 200 OK
“body_html": "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body>\n<p>The other day I was touching up a PR that had been approved and was about to merge and deploy it when, out of habit, I checked the clock. It was 3:45pm, which for me, was past my \"merge before\" time of 3:30pm. I decided to hold off and wait until the next morning. </p>\n\n<p>The whole process got me thinking. Does anyone else have their own personal merge or deploy policies? Is there a time before or after when you say, not today? Is there a day of the week you don’t like to merge stuff. A lot of people joke about read-only Fridays, but I have to admit, I kinda follow that rule. Anything remotely high risk I wait until Monday to merge. </p>\n\n<p>What’s your personal merge/deploy policy?</p>\n\n</body></html>\n",
"description": "What’s your personal merge/deploy policy?",
"readable_publish_date": "Mar 22",
"title": "What’s your personal merge/deploy policy?",
"name": "Molly Struve",
and bingo !!
All I had to do now was: number 1, find if articles’ id where sequential, and 2 if 1 was true, find the most recent article’s id.
Both things were easy to check. I just had to open my browser inspector a couple of times on recent articles.
What I did next was calling this API 94k times using scrappy and storing the information in a clear .csv. More thing on this in a future post.
What do we have now ?
Out of 94k API calls, almost half of them returned a 404: resource not found. I guess it means that half of the articles created are never published but I am not sure about it. I still had ~40k data points, which was more than enough to prove my point.
Each row in my csv had multiples useful information, but for what I was looking for I only needed two things: the number or like and the date of publishing.
Hopefully, those two things were returned by the API, see positive_reaction_count and published_at in the previous snippet.
To work with the data I used pandas, a well know python library, that is even one of the most famous python package on GitHub.
I’ll show here some code snippet, if you want a more thorough tutorial, please tell me in the comments.
Loading data from csv with pandas is very easy:
import pandas as pd
df = pd.read_csv(‘./output.csv’)
As I wanted to know the best time/day to post on dev.to, I need to transform the published_at column in 2 other columns: day_of_week (‘Mon’, ‘Tue’, …) and hour.
Pandas allow to easily add, transform and manipulate data. All I need to do this was those few lines:
df[‘hour’] = pd.to_datetime(df[‘published_at’]).dt.hour
days_arr = ["Mon","Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
date = pd.to_datetime(x)
df[‘day_of_week’] = df[‘published_at’].apply(get_day_of_week)
All my data is now stored in a dataframe, the main data structure used my pandas, hence the name: df.
A little bit a data viz
I had now all the informations I needed.
Here what was in my dataframe:
Each line representing one post, I had around 38k lines.
What I naturally did next was summing positive_reaction_count by day and hour.
Here is how to do it in pandas:
aggregated_df = df.groupby([‘day_of_week’, ‘hour’])[‘positive_reaction_count’].sum()
And now my df looked like this:
Great, in order to have exactly the data in the format I need, a few more work is necessary.
Basically rotating columns around.
pivoted_df = aggregated_df.reset_index().pivot(‘hour’, ‘day_of_week’, ‘positive_reaction_count’)
Now my df has this look:
And now, finally, I can use the seaborn package do display a nice heatmap.
import seaborn as sns
sns.heatmap(pivoted_sorted , cmap="coolwarm")
And here is what I got:
I find this heatmap very simple and self-explanatory. 2 regions stand out from the map. The red one, bottom left, and the dark blue one top right.
But first, because we are talking about times, we need to know what timezone we are talking about.
If you look carefully at the published_at": "2019-03-22T22:19:36.651Z, you will notice a Zat the end of the time string.
Well this Z indicates that this time string represents UTC time, or time zone Zero.
So, going back to our heatmap, we noticed that Monday to Wednesday afternoon (Monday and Wednesday morning for people on the east coast) are the more active zone on the map.
And, Saturday and Sunday are two very calms day, especially from midnight to noon.
So, here, at first sight, you could think that you better post those time to maximize your chances of having many likes. Well, we need to step back a little.
What this heatmap show is the time of the day where we observe the most likes in total. It does not take into account the fact that more posts automatically means more likes.
So maybe, right now we can’t know for sure, the red zone we see on the heatmap just means that we observe more like on the platform only because more articles are being posted during those times.
This difference is critical, because what we are trying to know is the best time to post in order to maximize our likes, and this map can’t help us.
So what we need it to make the same kind of map, but instead of counting the total of likes during one hour for each day we have to compute the mean of those numbers of likes.
We could also compute the median, I did it, and there is not much difference 🙂.
Thanks to pandas, we only to change one small thing in our code:
# sum -> mean
aggregated_df = df.groupby([‘day_of_week’, ‘hour’])[‘positive_reaction_count’].mean()
And here is the new heatmap:
As you can see, the heat map is very different and much more exploitable than the previous one.
We now observe strip patterns. There is this wide blue one spanning from Monday to Sunday from 4 a.m to 10 a.m.
We also observe a peak of activity during the UTC afternoon.
What we can now state following this heatmap is that
articles posted during the afternoon, on average, had 10~20 more positive interactions than the one posted very early during UTC day.
I think it is all about the reader/writer ratio, and what those two heatmaps show is that even though there is much less reader during the weekend, there is also proportionally less writer. This is why an article published during the weekend will have the same numbers of interactions than an article published during the week.
Thank you for reading:
I hope you liked this post.
This series is far from over, I have plenty more information to show you related to this dataset.
Please tell in the comments if you want a particular aspect of dev.to data analyzed and don’t forget to subscribe to my newsletter, there is more to come (And you’ll also get the first chapters of my next ebook for free 😎).
If you want to continue reading about some python tips, go there, you could like it :).
If you like JS, I’ve published something you might like.
And if you prefer git, I got you covered.