Math/statistics question: FB Likes gained and lost

Oystein · Jun 11, 2015

I am following a Facebook page and monitor the development of the number of "Likes" it has.

For a while now, I have written down the number of likes every day around the time of my second coffee in the morning (so sample frequency wasn't exactly constant). I found that the number increases by between 20 and 80 on most days.

However, this is a net gain: They get more than 20-80 new "likes" per day, but some users no doubt "unlike" them, or delete their account.

When I sample from hour to hour, I see that sometimes they have a net loss.

When I sample every 5 minutes, I get an even clearer picture.

What I want to figure out is: Is it possible, and how, to estimate from such sampling how many total new "likes" and how many "unlikes" there are per time intervall?

To give you an idea, here are the numbers for the last 3 hours, sampled at pretty constant 5-minute intervals:

minutes|likes|gain|loss
0|264067||
5|264067||
10|264067||
15|264067||
20|264067||
25|264067||
30|264067||
35|264066||-1
40|264063||-3
45|264064|1|
50|264063||-1
55|264062||-1
60|264062||
65|264064|2|
70|264063||-1
75|264063||
80|264064|1|
85|264064||
90|264064||
95|264064||
100|264063||-1
105|264063||
110|264063||
115|264063||
120|264063||
125|264063||
130|264063||
135|264063||
140|264063||
145|264064|1|
150|264064||
155|264064||
160|264064||
165|264064||
170|264065|1|
175|264065||
180|264065||
185|264065||
190|264065||
195|264065||
200|264065||
205|264066|1|

In total, only 1 "like" was lost, but I observed 7 gained and 8 lost in the meantime.

In many intervals, the number doesn't change. This will often mean there was neither a "like" gained nor lost. But could sometimes mean that 1 was lost, 1 gained. Or +2-2. Etc.
So the net loss -1 could be +7-8, or +8-9, or +9-10, etc.

Assuming that likes and unlikes arrive randomly (should be normally distributed), but realizing that rates may change during the day and from day to day, is there a way to use such sampling to figure out how many likes and unlikes there really are?

rjh01 · Jun 12, 2015

If you had a big enough sample you could, but currently your margin for error is huge. For example is your 3 hour sample typical? What would happen to your results if one person acted differently in that 3 hour sample?

Oystein · Jun 12, 2015

rjh01 said:
If you had a big enough sample you could, but currently your margin for error is huge. For example is your 3 hour sample typical? What would happen to your results if one person acted differently in that 3 hour sample?

Intuitively, I'd say that the frequency at which I sample is more important than number of data tuples.

Let's say, that, on average, every 3 minutes a like is added or subtracted.
If I sample at a very high rate, say every 1/100th of a second, I am almost guaranteed to catch every single change, but that would be very wasteful. I'd be almost as close if I sampled every 1 s. I'll still get a pretty good feel if I sample every 1 min, but if I sample only once a day, I think it will be almost impossible to estimate the true frequency with any confidence at all - in that case, even a million tuples will inconclusive (especially since true frequency varies).

What I am driving at is:
a) Is there an algorithm to estimate the average frequency of an event by sampling at a constant frequency?
b) If yes, what's an efficient frequency? "Efficient" would mean the frequency that gives me the highest confidence given a fixed number of samples. Or that minimizes the number of samples to achieve a desired level of confidence?

The problem with determining the best sample frequency obviously is that it will likely be some proportion or mutiple of the - yes unknown - true frequency.

TubbaBlubba · Jun 12, 2015

Oystein said:
Intuitively, I'd say that the frequency at which I sample is more important than number of data tuples.

Let's say, that, on average, every 3 minutes a like is added or subtracted.
If I sample at a very high rate, say every 1/100th of a second, I am almost guaranteed to catch every single change, but that would be very wasteful. I'd be almost as close if I sampled every 1 s. I'll still get a pretty good feel if I sample every 1 min, but if I sample only once a day, I think it will be almost impossible to estimate the true frequency with any confidence at all - in that case, even a million tuples will inconclusive (especially since true frequency varies).

What I am driving at is:
a) Is there an algorithm to estimate the average frequency of an event by sampling at a constant frequency?
b) If yes, what's an efficient frequency? "Efficient" would mean the frequency that gives me the highest confidence given a fixed number of samples. Or that minimizes the number of samples to achieve a desired level of confidence?

The problem with determining the best sample frequency obviously is that it will likely be some proportion or mutiple of the - yes unknown - true frequency.

I'm not sure how useful this is, but here are some general facts about discrete sampling of signals:
1. Generally, to uniquely detect a signal of frequency f, you need to sample at 2f (Shannon's sampling theorem).
2. In order to precisely pin down a frequency, your sample must be large (i.e. go on for a long time). Particularly, if you double your sampling frequency, you must make your sample twice as large (in number of elements, not time) for the same amount of accuracy (because then you're covering twice as many frequencies).
3. The accuracy of your frequency analysis is, under decent conditions, given by the quotient of the sampling frequency and sample length (i.e. if you take 5 samples per second, and your sample is 10 samples long, you will have an accuracy of about half a hertz, or two elements per hertz).

Ideally, then, you want to sample at no more than 2f of the highest expected frequency, and you want to sample for as long as possible.

Basically, looking at your data, you will need a much larger sample, and you are probably better off sampling a bit less frequently. Currently I can't see any way of separating the signal from the noise in there, it's simply not possible to discern any kind of periodicity. With a good enough sample, I think you could do two fourier transforms, and figure out the frequencies and amplitudes of the signals that amount to the "likes" and "dislikes".

Oystein · Jun 12, 2015

Thanks, TubbaBlubba, that looks like a lead (have to get my head around, though).

Perhaps I use the term "frequency" incorrectly on the occurrance of +1 and -1 events? I strongly assume these are truly random, in the sense that no two Facebook user will ever like or unlike in any time-coordinated fashion. A random process would produce nothing but noise, wouldn't it? (Is that noise white?)

Since my events are discrete and binary, I have a Bernoulli-Process, right?

And this Bernoulli Process, by adding discrete +1 and -1 events, produces a Random Walk. Since the probabilities of +1 and -1 are not equal, I have, more specifically, a Markov chain, right?

Now one more level of complication: the steps of the random walk occur after time intervals that have a normal distrubution (really?).

Ok I lost focus long ago

Where was I? Ah yes, questioning if events occuring after random time intervalls produce a random walk that can be said to have a frequency? Does average rate = frequency?

TubbaBlubba · Jun 12, 2015

Oystein said:
Thanks, TubbaBlubba, that looks like a lead (have to get my head around, though).

Perhaps I use the term "frequency" incorrectly on the occurrance of +1 and -1 events? I strongly assume these are truly random, in the sense that no two Facebook user will ever like or unlike in any time-coordinated fashion. A random process would produce nothing but noise, wouldn't it? (Is that noise white?)

I definitely think there are going to be patterns to the likes and unlikes. You're going to have times of the day with peaks, correlations with content publication, time zones, etc. But yes, it will be a very noisy process; it's possible that the "noise" component will be white but not necessary.

Since my events are discrete and binary, I have a Bernoulli-Process, right?
And this Bernoulli Process, by adding discrete +1 and -1 events, produces a Random Walk.

Considering only cases of +1/-1/0, you're going to have four possible cases: 0 likes, 0 dislikes; 1 like, 0 dislikes; 0 likes, 1 dislike; 1 like, 1 dislike. The first and last are indistinguishable, but rarely occur if you sample frequently enough, so you may want to consider the two processes separately.

Since the probabilities of +1 and -1 are not equal, I have, more specifically, a Markov chain, right?

Now one more level of complication: the steps of the random walk occur after time intervals that have a normal distrubution (really?).

The Bernoulli process is a specific case of a Markov Chain, not the other way around. In this case, we have the additional complication that past and future events are not independent, because Facebook favours pages that have seen recent activity, but hopefully we can disregard this. A normal distribution seems plausible; in that case you should be able to determine its parameters.

Ok I lost focus long ago Where was I? Ah yes, questioning if events occuring after random time intervalls produce a random walk that can be said to have a frequency? Does average rate = frequency?

No, the notion of frequency presumes some manner of periodicity (as I discussed above). However, if a noisy signal contains periodic elements, you can extract and analyse those elements using fourier analysis; essentially, you'll be able to tell if there are large-scale patterns in the signal, and how much of it is noise.

So, it really depends on exactly what you want to find out. You can calculate the mean time between likes with probability distributions, you can find out if there are recurring patterns with a Fourier analysis, you can just add them all up and divide by the total time for an average rate, and so on.

Oystein · Jun 12, 2015

[Note: I am quoting in reverse order, and not everything]

TubbaBlubba said:
No, the notion of frequency presumes some manner of periodicity (as I discussed above). However, if a noisy signal contains periodic elements, you can extract and analyse those elements using fourier analysis; essentially, you'll be able to tell if there are large-scale patterns in the signal, and how much of it is noise.

So, it really depends on exactly what you want to find out. You can calculate the mean time between likes with probability distributions, you can find out if there are recurring patterns with a Fourier analysis, you can just add them all up and divide by the total time for an average rate, and so on.

Aha!
Important point!

What I want to find out is: At which average rate do likes come in, and at which rate do likes go out?
In the end I want to be able to say something like:
"In June 2015, the number of Likes increased by 30/day on average, but that net gain was actually the sum of, on average, 180 likes and 150 unlikes".
The net gain of 30 is easy to figure out - I sample once every day, at the same time, and subtract #(today)-#(yesterday), and then take the average for the month.
I am interested now if every day there is a ton of fluctuation, or very little.

I realize that the rate changes non-randomly - as you say with time of day, and with the publishing of posts etc., but I am not much interested in that so far. If I find I have a method to get a clear picture on the true dynamics, I might start to get interested for example in the impact of certain posts.

TubbaBlubba said:
...
Considering only cases of +1/-1/0, you're going to have four possible cases: 0 likes, 0 dislikes; 1 like, 0 dislikes; 0 likes, 1 dislike; 1 like, 1 dislike. The first and last are indistinguishable, but rarely occur if you sample frequently enough, so you may want to consider the two processes separately.
...

Yes on the highlighted, but there's the rub: I DO want to distinguish them; or know how often I need to sample to make cases of "+1-1" rare "enough" to ignore.
I have a hunch that the distribution of net gains of ...-3; -2, -1, 0, +1, +2, +3... in a data series sampled at constant frequency should give me clues about how likely "0" observations are actual "0" or "+1-1" or "+2-2" etc. So if I find that 30% of the time I get a "0" observation, and some smart statictics argument tells me 20 of the 30% are actual "0", 8 are "+1-1", 1.8 are "+2-2", then I can use that information.
(Similarly, "+1" observations can be true "+1", or "+2-1", "+3-2" etc.)

Am I explaining this well enough?

It seems that 5-minutes sampling intervals aren't too far from optimal. Since I am currently sampling manually (I have an alarm set that snoozes 5 minutes; when it goes off, I push it back to snooze, open the Firefox tab with that FB page and hit "F5", then copy the time and # of "Likes" into a spreadsheet; just to get a feel. Eventually I might look at automatizicationing this. Yes, this wasn't a word. My brain hurts already, can't find a real word :boxedin:

)

TubbaBlubba · Jun 12, 2015

Then, the essentials of the sampling theorem still apply: You will have to increase your sampling rate until you are satisfied that there are no signals that pass under your radar. If you hypothesize that a significant portion of your 0:s are actually composed of a high frequency signal (a regular pattern of +1/-1), the only way to check is to increase your sampling frequency.

Any other line of reasoning will essentially be based on assumptions of reasonable frequencies of those signals. Is it likely that there are two very closely matched processes resulting in, say, a like and a dislike every minute where they only end up mismatching by 2-3 likes in an hour? Not really. But you can always recreate any low-frequency pattern with a phase-shifted high frequency signal (again, the Sampling theorem), so the only way to be sure is to sample at a rate where you're fairly certain you're not missing anything (or to compare spectral analyses at two different sampling rates, but that's going to be hard in this case.)

Oystein · Jun 13, 2015

Ok, I am beginning to accept that I can't decrease sample frequency much and still hope for useful data to figure out what I want to figure out.

In the meantime, during the last 2 days, I have, at different times of day, taken a total of 116 5-minutes-samples. Once 36 or so in a row, sometimes 5 in row, or 12 in a row.
I think it doesn't really matter if the samples are not consecutive, it's more important that they are as nearly 5 minute each as I can get them, and as numerous as I can.

Of the 116 samples, the net gains/loss are distributed thusly:

+/-|# of samples
-3|1
-2|1
-1|19
0|75
+1|17
+2|3
+3|0

The gains add up to +23, the losses to -24, for an overal net gain of -1 during a total time of 116*5 = 580 minutes (9:40 hours).
In 5 of these samples (4.3%), I have more than 1 loss or gain, but only once a loss of -3.
As a first approximation, I'd venture to guess that there may have been also around 5 samples where +1 and -1 canceled out.
So I estimate that I have -29 individual unlikes per 580 minutes = 3 per hour, 24 per day.

Ok, thanks for the food for thought, TubbaBlubba, I suppose there is no good reason for me to try to get much more mathy and precise.

aleCcowaN · Jun 13, 2015

Oystein said:
Assuming that likes and unlikes arrive randomly (should be normally distributed), but realizing that rates may change during the day and from day to day, is there a way to use such sampling to figure out how many likes and unlikes there really are?

I don't understand the purpose of your question. Let's move at a coarser yet more significant level. There are lots of assumptions in your analysis that sound unrealistic. For instance, that Facebook offers a real time photo of the "likes" and they don't process information by batches; that they have a unique database and a unique server so you get your number of likes from the same exact source and process every time you fetch it; that people feel like to like the page at a constant rate and not because of what they say, publish, promote, advertise, etcetera; also, people feel like to dislike the page à la Poisson and not, again, because of what they say, they are mentioned or whatever the heck. How many "dislikes" come from accounts that are terminated? I for one hope I will be able to sent my FB page into oblivion soon, as I was forced to get an account in that humongous piece of manure to gain access to a forum I no longer like or need. On the contrary, do terminated accounts keep the likes? Do you know for sure?

You are asking for finer statistical analysis without showing how the system works.

DevilsAdvocate · Jun 14, 2015

Oystein said:
Assuming that likes and unlikes arrive randomly (should be normally distributed), but realizing that rates may change during the day and from day to day, is there a way to use such sampling to figure out how many likes and unlikes there really are?

Not with the data you have. A simple method would be to assume a constant ratio of likes to dislikes. If the interval sample period showed 2 likes for every unlike, we would assume a ratio of 2:1. With an average daily increase of 60 likes, we would then calculate an estimated average of 120 likes and 60 unlikes per day.

However, the sample period shows a decrease in total likes, which is inconsistent with the daily increase. That tells us that the ratio must be higher during the other periods of the day, but we don’t know how much higher. It could be that during the other 21 hours there are 60 likes and 0 unlikes, or something like 5000 likes and 4970 unlikes.

We need to know either the ratio of likes to unlikes, or the total number of votes (either like or unlike), but the sample provides no indication of either. All that we know is that the ratio is different during the other times of the day, and likely the voting activity is also different. But we can’t know how different.

Oystein · Jun 14, 2015

aleCcowaN said:
I don't understand the purpose of your question. Let's move at a coarser yet more significant level. There are lots of assumptions in your analysis that sound unrealistic.

Maybe.

aleCcowaN said:
For instance, that Facebook offers a real time photo of the "likes"

There appear to be frequent updates, mostly by increments of +1 or -1. It doesn't matter so much if there is a delay between some user clicking "like" and the page showing an increment of +1, and also not important if that delay is constant, or evenb if the in-/decrements are shown in correct order. Only important thing is that I get to see most of each increment.

aleCcowaN said:
and they don't process information by batches;

The fact that most changes are +/-1 seemt to indicate that they don't do batches; OR that the batches are processed at a frequency higher than the rate at which changes come in.

aleCcowaN said:
that they have a unique database and a unique server so you get your number of likes from the same exact source and process every time you fetch it;

I think this is the most important possible problem that you came up with. Hmmm, don't know how to conrol this

aleCcowaN said:
that people feel like to like the page at a constant rate and not because of what they say, publish, promote, advertise, etcetera; also, people feel like to dislike the page à la Poisson and not, again, because of what they say, they are mentioned or whatever the heck.

They don't publish a lot, only like once in three days. I have not yet thought about doing stats before and after a publication; first things first: If I don't have an algorithm that gives me better than ad-hoc estimates for the rate if the rate is constant, then I don't need to bother with changing rates.

aleCcowaN said:
How many "dislikes" come from accounts that are terminated? I for one hope I will be able to sent my FB page into oblivion soon, as I was forced to get an account in that humongous piece of manure to gain access to a forum I no longer like or need. On the contrary, do terminated accounts keep the likes? Do you know for sure?

I think that yes, terminating an account will remove a "Like" from the counter, but I don't know for sure.

That is not terribly interesting to me, actually.

aleCcowaN said:
You are asking for finer statistical analysis without showing how the system works.

Objections noted with interest. Thank you!

Oystein · Jun 14, 2015

DevilsAdvocate said:
Not with the data you have. A simple method would be to assume a constant ratio of likes to dislikes. If the interval sample period showed 2 likes for every unlike, we would assume a ratio of 2:1. With an average daily increase of 60 likes, we would then calculate an estimated average of 120 likes and 60 unlikes per day.

Since I want to find out the rate of unlikes, I should not assume too much about it. I realize by now that I can't do it without both a high sampling frequency of f(sample) >= 2 * (f(likes)+f(unlikes))

DevilsAdvocate said:
However, the sample period shows a decrease in total likes, which is inconsistent with the daily increase.

This is by mere chance.

DevilsAdvocate said:
That tells us that the ratio must be higher during the other periods of the day, but we don’t know how much higher.

The 116 5-minute samples came from different times of the day. The result is coincidence. 116 samples is simply not enough, given that both rates are changing up and down.
The sample in my OP were done on a day when I noticed for the first time that the number of likes was decreasing. Which is what prompted me to open this thread. Therefore, that outlier is not a random occurrence.

DevilsAdvocate said:
It could be that during the other 21 hours there are 60 likes and 0 unlikes, or something like 5000 likes and 4970 unlikes.

I have sampled at different times of the day and found that rates vary somewhat, but not nearly as drastically as your example. Almost always, the majority of samples have +/-0.

DevilsAdvocate said:
We need to know either the ratio of likes to unlikes, or the total number of votes (either like or unlike), but the sample provides no indication of either. All that we know is that the ratio is different during the other times of the day, and likely the voting activity is also different. But we can’t know how different.

I would be fine with getting averages for a full day.

But as I said - I am accepting that nothing really can substitute for high frequency* and large number of sample.

* At which point aleCcowaN's objection that FB may not give me reliable results at high frequencies applies, and possibly frustrates my entire endeavour.
Perhaps a wholly different approach is necessary - such as crawling FB and listing all accounts that "Like" my page; 24 hours apart; and count those that are there at the beginning of the day but not at the end; and vice versa. AND I am not going to do this

jt512 · Jun 14, 2015

Oystein said:
Since I want to find out the rate of unlikes, I should not assume too much about it. I realize by now that I can't do it without both a high sampling frequency of f(sample) >= 2 * (f(likes)+f(unlikes))

From the data in your OP it seems to me that likes and dislikes are sparse enough that if you were to assume that each rating of n likes or dislikes represented exactly n likes or dislikes, your resulting estimated rates would be reasonably accurate. You could always adjust them subjectively by making a reasonable guess at the probability that a 5-minute change in likes of ±1 was due to offsetting ratings of ±2 and ∓1.

Another approach might be to increase the sampling frequency within certain 5-minute intervals when you expect activity to be high. This would provide you with an empirical frequency distribution of the actual number of likes and dislikes represented by a 5-minute-intervals net rating of n for all values of n that you observed. You could then treat each future 5-minute net rating n as if it represented the observed frequency distribution for n.

Finally, in this age of Big Data, isn't there some way to get the actual stream of likes and dislikes without manual intervention? Or, alternatively, running a script, say daily, to compare the accounts who have liked your page with those from the previous day. That would provide the actual number of likes and unlikes over the previous 24 hours, excluding anyone who liked and unliked your page in the same day (which you might want to exclude anyway).

Oystein · Jun 14, 2015

jt512 said:
From the data in your OP it seems to me that likes and dislikes are sparse enough that if you were to assume that each rating of n likes or dislikes represented exactly n likes or dislikes, your resulting estimated rates would be reasonably accurate. You could always adjust them subjectively by making a reasonable guess at the probability that a 5-minute change in likes of ±1 was due to offsetting ratings of ±2 and ∓1.

Another approach might be to increase the sampling frequency within certain 5-minute intervals when you expect activity to be high. This would provide you with an empirical frequency distribution of the actual number of likes and dislikes represented by a 5-minute-intervals net rating of n for all values of n that you observed. You could then treat each future 5-minute net rating n as if it represented the observed frequency distribution for n.

These two approaches are what I will actually end up doing, and, frankly, it's really sufficient for my purposes.

This manual sampling every few minutes is tedious and disruptive, so I came here thinking that perhaps longer intervals could be cracked using some smart algorithms that exploit a few (assumed) properties of the situation; such as that this is a case of a random walk, the sums of which should be binomially distrubuted. By looking at the distribution of net gains/losses, perhaps one could conclude on the average number of random walk steps per interval; but thinking about this, it appears that the rates change too fast and too often to ever yield anything useful, so I am back at high frequency sampling, which is conceptually easy enough to interprete.

jt512 said:
Finally, in this age of Big Data, isn't there some way to get the actual stream of likes and dislikes without manual intervention? Or, alternatively, running a script, say daily, to compare the accounts who have liked your page with those from the previous day. That would provide the actual number of likes and unlikes over the previous 24 hours, excluding anyone who liked and unliked your page in the same day (which you might want to exclude anyway).

It's not my page, so I don't have direct access to who liked it.
Plus, that would be quite a cannon aimed at a sparrow.

Folks, it's been a learning experience, thanks for y'all's input!

Math/statistics question: FB Likes gained and lost

Oystein

Penultimate Amazing

rjh01

Gentleman of leisure

Oystein

Penultimate Amazing

TubbaBlubba

Knave of the Dudes

Oystein

Penultimate Amazing

TubbaBlubba

Knave of the Dudes

Oystein

Penultimate Amazing

TubbaBlubba

Knave of the Dudes

Oystein

Penultimate Amazing

aleCcowaN

imperfecto del subjuntivo

DevilsAdvocate

Philosopher

Oystein

Penultimate Amazing

Oystein

Penultimate Amazing

jt512

Philosopher

Oystein

Penultimate Amazing