Chromatic Orb probabilities Spreadsheet (Update: N = 1570)

Hi all,
inspired by this post on reddit:

I decided to do some thinking about the function which determines socket colors after you use a chromatic orb. It's well understood that color chances are a function of the stat requirements of the item. Higher the stat requirements, the more biased the roll is in favor of the color associated with that stat. An item with no stat requirements should roll all colors equally. The exact function for this has been a mystery in spite of one attempt (during open beta) to get a community log going:


So I've started my own community log. The ultimate goal of the project is not just to test my theory of how chromatics work, but to create a "chromatic calculator" where you could input the information about your item and the set of colors you want out, and get an exact answer of how likely you are to see that roll after some number of chromatic orbs. Right now, I'm working on setting up a website to do this. There will be two versions. The first will be an Excel spreadsheet. This is already completed, although it's still in testing and I'll be mucking around with it further before official release. You can download it below. The second will be in the form of a Python script which you can execute on the website directly, that way you can get your answers without having to download anything at all!

Now onto the details!

How I think chromatics work:
Spoiler

1) Every socket is rolled independently of every other socket

2) If the colors on the freshly rolled item match exactly the colors on the item previously, the entire rolling process is repeated, such that rolling a chromatic will never result in the exact same configuration of socket colors (although the total number of each color may remain the same)

3) Color probabilities are rolled similarly to Jeweler's and Fusions:

Every possible option (R, G, or B in this case) is given an integer value. Let's call R the integer for red, B for blue and G for green. A roll then occurs from 1 to the sum of these integers and where that roll fits determines the color of the socket. Now, of course, this exact proposal wouldn't work! A 200INT item would have zero chance of rolling non-blue sockets. My original conjecture was that a constant X is added to off-colors in order to give them non-zero probability. I believe my fellow Redditors that it's more likely that every color has its stat requirement augmented by X.

So, for this example we have a 200INT item, with let's say one socket to roll. We use a chromatic. The chance of blue is (200 + X) / (200 + 3X), the chance of green is (X) / (200 + 3X), and the chance of red is the same at (X) / (200 + 3X).

The nice aspect of this theory is that there is only a single constant that we need to infer. We need data to confirm the exact number, and it may be the case that this is not how PoE actually instantiates chromes. Still, data would be nice. Therefore I'm starting a community log. I'm a very poor self-found character, so I'm just going to note my typical usage.


Here are the rules for the log:
Spoiler

1) Note each roll in detail. We need all the base item stat requirements, the number of previous colors on the item (sum of each color), and the sum of each color for the rolled item afterward.

2) The purpose of collecting the previous colors on the item is to avoid bias in reporting. You need to start recording information before the roll happens, this should avoid cases where people report only "rare" occurrences.

3) An additional purpose of collecting the previous colors is so that we can run a simulation and take into account the fact that you'll never roll the exact same colors twice. This matters less for items with many sockets, but heavily biases items with fewer sockets and with lower base stat requirements.

4) Obviously don't mess with other people's data.

5) I had originally hoped to avoid items with -stat requirement mods, but I seem to recall reading a mod post that made it clear that chromatics would roll off the modified stat values. So if you want to add data from such an item be sure to post the listed stat requirements rather than the requirements for the base item type.

If you're interested, let me know here to add visibility. Otherwise, just add your data and I'll report back once we've got enough that we can run the stats. Just fyi, I'm planning on calculating X using a simple Metropolis-Hastings algorithm. I'm assuming X exists between 0 and 100 with uniform prior.


Machine learning details (for those interested):
Spoiler

So this is an interesting little problem. Because we can calculate the exact probability of getting any set of colors from the previous set (given a specific value of X), this is a problem which can lend itself well to many parameter estimation techniques. Because my background is in Bayesian methods (and I love when my programs run super slowly), I decided to go with Metropolis-Hastings. For those not familiar, MH is a Markov Chain Monte Carlo method of estimation. Feel free to google, but basically over time it's going to come up with samples of what it thinks the value of X is. Some samples will be higher, some lower. And when you look at the set of values that were chosen, those values will (in the limit) represent the true probability distribution of X, given the data.

So what we get out of the algorithm is not just a single value of X, but we get a distribution of values. So, maybe 20% of the samples had X=11, 60% were X=12, and 20% were X=13. That tells us that 12 is the single best estimate, but there's some uncertainty and we can use that later to provide better estimates for what we care about, the probability of rolling the colors we want.

So the first step is to come up with an estimate of X, and like I said I'm using MH to do that. Specifically, I'm using a single chain of length 10,000 (with a 500 iteration burnin). We're doing the learning on all of the data (pre- and post-1.1), there's no train/test set, but I don't think overfitting will be an issue given that there's only a single parameter. The initial value of X is randomized, but all for the model are constrained to be maximally 100 and minimally 0. Rounding occurs at all steps for X so that we only end up considering integer values. Previously, I had allowed the model to consider non-integers as well, but in the end I think it makes sense to constrain it and it makes life easier later.

After we do our estimation, we want to know what we expect the data to look like. To do this, we can create some fake data. To do this, currently I'm using just the single best point estimate of X, but I intend to change this so it uses the full distribution instead. The dataset is set up so that each line corresponds to a single chromatic roll. So we can go through each line and look at the configuration of the item before the roll was made, and we can calculate the probabilities of all possible color configurations. We do this for each line in the data and then we can make calculations based off of these probabilities. So, for instance, we can know exactly how many red sockets we expect if we use a chromatic.

Previously, this was calculated using Monte Carlo simulation, but the simulation is not necessary and the exact calculations are much faster.


And here's the spreadsheet:


And all code involved in the data analysis if you're interested in the details (this is still an old version, I'll be updating it soon):


And the current Chromatic Calculator spreadsheet. In order to make the calculations just enter in the current set of colors for your item, it's stat requirements, and the set of colors you want. Hit calculate (you'll need to enable macros) and you'll get the numbers. On the bottom left is the most accurate calculation so far. It averages over each value of X based (right now) on the most up-to-date estimates we have. To the right you can also set a specific value of X and you'll see the probabilities after N number of chromatics if that value of X is the "true" value. Since this is a macro-enabled Excel sheet, you won't be able to view it in Google Docs, I'm just using it for hosting right now, so feel free to download and mess around with it!


I'm updating the title with the number of data points we currently have. I'll update this original post if there's crazy news. Otherwise, I'll be updating a post below with the results of the analysis so far.
Last edited by lawphill on Apr 1, 2014, 1:54:38 PM
Update! We currently have 1570 datapoints! Getting better, but still need those submissions!

So, in the past the best estimate for our value of X has been somewhere between 14-16. That number seems to be trending downwards as we add more data, essentially what this means is that the early data was very likely biased towards higher numbers of off-colors. Current best guess at X is 13. I've changed the estimation such that now only integer values of X are considered. There was also an incorrect formula for calculating probabilities which has been fixed. The 95% credible interval is [12 14]. I should mention that this range indicates our certainty of the estimate given the data, not our certainty in the data itself. This is why we see shifts in the best estimate, because our data has gotten larger/better.

One way to visualize the data is to calculate how often an item came up with X number of a given colored socket. You can see the actual distribution here along with a simulated distribution which you'd predict if my model were true: http://imgur.com/a/k3y8L

As always, take the following updates with a grain of salt. Data collection continues. Even 1570 rolls is not a large amount of data for this kind of estimate. If we assume the model is correct, I can simulate 1570 data points at a time and see what the estimate comes out as. I did this informally back in February and setting X=15 and letting the algorithm do its learning to see what came out. I saw a lot ranging from 13 - 17, all based on the specific rolls that happened to be there. So basically, we're going to have to keep collecting data.

It's also worth mentioning too that this kind of model fitting is very easy to do, but it rests on the assumption that the model is accurate. There are basically two ways to do this. The first is model comparison. Take my model, compare it versus another, see which explains the data better. The second is to use statistical testing to compare my model's theoretical predictions versus what the actual data looks like.

Model comparison:
Spoiler

Bayesian model comparison is a very useful way to compare two different models. The biggest problem with this is a lack of fleshed-out competing models. One thing to consider is that item mods have weights which differ depending on the item type (e.g. axe,sword,chest,ring,etc...) It's possible this is also true for chromes, if this is the case, we'd need much, much more data to prove it since instead of estimating 1 parameter, we'd need to estimate many dozens potentially. Other possible models might instantiate the check to make sure you don't reroll the same exact colors in a different way, perhaps one which biases the item to roll certain color combinations more frequently. With the time that I have, I'm not focusing on model comparison at the moment, but if someone has a proposal I'll definitely test it. Otherwise this won't really happen until I have a better idea in my mind of what a different model might look like.


Statistical testing:
Spoiler

First off, these numbers are from Feb 14th. I'm not going to continue updating this section unless the statistics come out drastically differently. Otherwise, you can just assume everything here still holds in spirit.

So an obvious question is whether or not my model is the true model. This is a difficult question to answer obviously. The best we can do is to have the model create predictions about what the data should look like, and then compare the actual data to these predictions. Since the data takes the form of frequency counts (i.e. how often did X color combination appear?), we can compare the observed frequencies versus the expected frequencies using a chi-square test. One limit to the chi-square, is that it works best when all expected frequencies are at least 5. This isn't a problem if we want to test how many red sockets we expect, but it is a problem if we are trying to test how often we expect 6 red sockets to appear in the data set. Because of this, I can't run a chi-square test on the distribution presented above. Instead, here are some other comparisons I've been looking at:

Total number of red, green, blue sockets
Observed = [352, 928, 1468]
Expected = [358.4, 934.1, 1455.5] (for X = 15)

The differences here are not statistically significant (p = .8770, chi2 = 0.2624, df = 2)

Number of items with pure colors, only 2 colors, or all 3 colors
Observed = [114, 358, 129] (1 color only, 2 colors, 3 colors)
Expected = [111.6, 370.8, 118.6] (X = 15)

No statistical significance (p = .4959, chi2 = 1.4028, df=2). Not enough evidence to conclude there are more tricolor items in the game than the model would account for.

Number of items rolling 0,1,2,... of each color socket
This is the distribution I presented above. Because items with 5 or 6 sockets of a single color are rare, we can't run a chi-square on the full dataset. We could run it instead with 5 possible outcomes for each color. So for red sockets we might have [# w/ 0 red, # w/ 1, # w/ 2, # w/3, # w/4-6]

This still has one outcome whose expected value is below 5, but the chi2 should still do ok since it's only 1 cell. This came out not significant (p = .7619, chi2 = 4.9598, df = 8). I had originally been calculating the statistic incorrectly (whoops!), but that's fixed now for all the measures.


So basically the evidence right now indicates that the model is a good fit to the data. I'm of course very pleased with this. But its worth mentioning that there's a lot of randomness inherent in this kind of a process. The exact estimate of X is going to take a while to actually discover, and perhaps with a larger dataset we will have enough information to say that another slightly different model might fit the data better.

If you have other aspects of the data you'd like to see tested, please let me know! Currently, it would appear the model fits the data quite well. Well enough that I don't think anything is going to change or that it's particular worth investing time in coming up with competing models. I'll be continuing data collection so we can get a better estimate of the exact value of X (in particular whether 14 or 15 is the better value) so that someone can take that and create a chromatic calculator to estimate the odds for any given socket combo.

Spoilering this original data analysis post, which concerned a small set of data I collected on the colors of dropped items.
Spoiler
Quick follow up. I've coded up the ML algorithms needed to run on the data, just don't have enough data yet. It may take quite some time, since the amount of work required to note all the details is relatively high.

In the meantime, I've started a personal log of socket colors from dropped items (from enemies and chests). The amount of data is small (n=287, representing 615 sockets) and the calculations involved are slightly different than for chromes. Namely, you don't need to worry about rolling the exact same sockets twice in a row.

So, word of warning there's no guarantee that the drop data uses the same function as the chromatics will. That said, I think it's likely. On the other hand, I'm wary of using socket colors from vendors since GGG might have good incentive to avoid certain rolls (namely chrome vendor recipe items) in order to control the economy. Of course, it's possible GGG artificially controls how often 3-linked 3-color items drop. As I said, no guarantees!

That being said, I've run the drop data and the result is that X is very likely 20. I'm using Bayesian techniques, so a credible interval is worth reporting. That is to say there's a 95% chance that X lies in the range [15.16,23.80].

20 was interestingly my best guess before I had any data. The reason being that the general rule of thumb for a single-stat item is that chromatics have 80% chance for on-color and 10% chance for each off-color. If X is indeed 20, then when we roll a 200INT item, we should have (ignoring the "can't roll the exact same colors" issue) an 84.62% chance of rolling blue, and 7.69% chance of rolling red or green.

Can't wait for more data!
Last edited by lawphill on Apr 1, 2014, 1:17:33 PM
I suspect it's a bit more complex. For example, this wand:


It has no requirements. Anyway, after about 60 Chroms, fully half were BGR. It seemed to try REALLY hard to have one of each.

Same goes for Hybrid Armor/Evasion/ES gear, where it seriously wants to have BOTH. Not the same chance for one of the other, but BOTH at the same time.

Need exact numbers for the chest thing, but I can tell you that while rolling a 6L Voll's, it was almost always a mix of Blue/Red, to the effect that it feels like it's not rolling per Socket, but PER ITEM. I tossed about 600 Chroms on that one, so I got a fair share of color options.
I Cast Magic Missile! - poeurl.com/DHE
Last edited by CantripN on Jan 18, 2014, 12:55:12 PM
added some data but as suggested from the previous it might well be more complicated than it appears.
A shame I didn't record my use of 2100+chrom when I coloured my armour, that would have been a lot of data at once!
I'm definitely interested in analyzing the data however I can to figure out what's going on. In particular, if someone else has a theory that you think might account for the data better, please let me know! I'm open to suggestions. GGG seems to really like using integers for their probability calculations, so I've been trying to think of other ways they could implement the function which would avoid the need for floating points.

RE: chromes rolling per item rather than per socket, this is definitely a possibility. Another possibility is that chromes roll per group of linked sockets. GGG could use either of these setups to increase the probability of things like tri-colored, linked sockets. I'll definitely try to see if these color combinations are occurring more often than chance would predict. Because I'm not asking for detailed socket information (e.g links) I won't be able to distinguish between these two explanations, but with enough data we should be able to determine whether or not the data suggests independence between sockets or not.

Thanks to everyone who's submitted data! I haven't had a use to spend chromes lately, but I'll be adding some of my own data sporadically.
If someone is interested, the formula to get the probability P of a given colour combination with Pr as the prob of each single red colour, Pb and Pg for blue and green (Pr+Pb+Pg=1),

in a cluster of N link with Nr number of desired red sockets, Nb, Ng number of blue and green (with 1<=N<=6 and Nr+Ng+Nb=N) is

P = N! * (Pr^Nr) * (Pb^Nb) * (Pg^Ng) / (Nr!*Nb!*Ng!)

( i hope my calculations are exact... :) )
Roma timezone (Italy)
For ease of reading, I've moved the updated reports to the top post, I'll keep archived reports down here.

Archived reports:
Spoiler

Feb 25th
Spoiler

Update! We currently have 1242 datapoints! Many thanks to Daarknight, who helped double the size of the corpus!

So, in the past the best estimate for our value of X has been somewhere between 14-16. That number seems to be trending downwards as we add more data, essentially what this means is that the early data was very likely biased towards higher numbers of off-colors. Current best guess at X is 12 (the data fits 12.05, with 95% certainty in the range [11.19, 13.09]). I should mention that this range indicates our certainty that the estimate is good, not that the data itself is good. This is why we can see a large shift in the best estimate, because our data has gotten larger/better.

One way to visualize the data is to calculate how often an item came up with X number of a given colored socket. You can see the actual distribution here along with a simulated distribution which you'd predict if my model were true: http://imgur.com/a/BsTSm

As always, take the following updates with a grain of salt. Data collection continues. Even 1200 rolls is not a large amount of data for this kind of estimate. If we assume the model is correct, I can simulate 1200 data points at a time and see what the estimate comes out as, doing this informally today, I set X=15 and then saw what estimates came out. I saw a lot ranging from 13 - 17, all based on the specific rolls that happened to be there. So basically, we're going to have to keep collecting data.

It's also worth mentioning too that this kind of model fitting is very easy to do, but it rests on the assumption that the model is accurate. There are basically two ways to do this. The first is model comparison. Take my model, compare it versus another, see which explains the data better. The second is to use statistical testing to compare my model's theoretical predictions versus what the actual data looks like.

Model comparison:
Spoiler

Bayesian model comparison is a very useful way to compare two different models. The biggest problem with this is a lack of fleshed-out competing models. One thing to consider is that item mods have weights which differ depending on the item type (e.g. axe,sword,chest,ring,etc...) It's possible this is also true for chromes, if this is the case, we'd need much, much more data to prove it since instead of estimating 1 parameter, we'd need to estimate many dozens potentially. Other possible models might instantiate the check to make sure you don't reroll the same exact colors in a different way, perhaps one which biases the item to roll certain color combinations more frequently. With the time that I have, I'm not focusing on model comparison at the moment, but if someone has a proposal I'll definitely test it. Otherwise this won't really happen until I have a better idea in my mind of what a different model might look like.


Statistical testing:
Spoiler

First off, these numbers are from Feb 14th. I'm not going to continue updating this section unless the statistics come out drastically differently. Otherwise, you can just assume everything here still holds in spirit.

So an obvious question is whether or not my model is the true model. This is a difficult question to answer obviously. The best we can do is to have the model create predictions about what the data should look like, and then compare the actual data to these predictions. Since the data takes the form of frequency counts (i.e. how often did X color combination appear?), we can compare the observed frequencies versus the expected frequencies using a chi-square test. One limit to the chi-square, is that it works best when all expected frequencies are at least 5. This isn't a problem if we want to test how many red sockets we expect, but it is a problem if we are trying to test how often we expect 6 red sockets to appear in the data set. Because of this, I can't run a chi-square test on the distribution presented above. Instead, here are some other comparisons I've been looking at:

Total number of red, green, blue sockets
Observed = [352, 928, 1468]
Expected = [358.4, 934.1, 1455.5] (for X = 15)

The differences here are not statistically significant (p = .8770, chi2 = 0.2624, df = 2)

Number of items with pure colors, only 2 colors, or all 3 colors
Observed = [114, 358, 129] (1 color only, 2 colors, 3 colors)
Expected = [111.6, 370.8, 118.6] (X = 15)

No statistical significance (p = .4959, chi2 = 1.4028, df=2). Not enough evidence to conclude there are more tricolor items in the game than the model would account for.

Number of items rolling 0,1,2,... of each color socket
This is the distribution I presented above. Because items with 5 or 6 sockets of a single color are rare, we can't run a chi-square on the full dataset. We could run it instead with 5 possible outcomes for each color. So for red sockets we might have [# w/ 0 red, # w/ 1, # w/ 2, # w/3, # w/4-6]

This still has one outcome whose expected value is below 5, but the chi2 should still do ok since it's only 1 cell. This came out not significant (p = .7619, chi2 = 4.9598, df = 8). I had originally been calculating the statistic incorrectly (whoops!), but that's fixed now for all the measures.


So basically the evidence right now indicates that the model is a good fit to the data. I'm of course very pleased with this. But its worth mentioning that there's a lot of randomness inherent in this kind of a process. The exact estimate of X is going to take a while to actually discover, and perhaps with a larger dataset we will have enough information to say that another slightly different model might fit the data better.

If you have other aspects of the data you'd like to see tested, please let me know! Currently, it would appear the model fits the data quite well. Well enough that I don't think anything is going to change or that it's particular worth investing time in coming up with competing models. I'll be continuing data collection so we can get a better estimate of the exact value of X (in particular whether 14 or 15 is the better value) so that someone can take that and create a chromatic calculator to estimate the odds for any given socket combo.

Spoilering this original data analysis post, which concerned a small set of data I collected on the colors of dropped items.
Spoiler
Quick follow up. I've coded up the ML algorithms needed to run on the data, just don't have enough data yet. It may take quite some time, since the amount of work required to note all the details is relatively high.

In the meantime, I've started a personal log of socket colors from dropped items (from enemies and chests). The amount of data is small (n=287, representing 615 sockets) and the calculations involved are slightly different than for chromes. Namely, you don't need to worry about rolling the exact same sockets twice in a row.

So, word of warning there's no guarantee that the drop data uses the same function as the chromatics will. That said, I think it's likely. On the other hand, I'm wary of using socket colors from vendors since GGG might have good incentive to avoid certain rolls (namely chrome vendor recipe items) in order to control the economy. Of course, it's possible GGG artificially controls how often 3-linked 3-color items drop. As I said, no guarantees!

That being said, I've run the drop data and the result is that X is very likely 20. I'm using Bayesian techniques, so a credible interval is worth reporting. That is to say there's a 95% chance that X lies in the range [15.16,23.80].

20 was interestingly my best guess before I had any data. The reason being that the general rule of thumb for a single-stat item is that chromatics have 80% chance for on-color and 10% chance for each off-color. If X is indeed 20, then when we roll a 200INT item, we should have (ignoring the "can't roll the exact same colors" issue) an 84.62% chance of rolling blue, and 7.69% chance of rolling red or green.

Can't wait for more data!


Feb 14th
Spoiler
Update! We currently have 601 datapoints! Data collection has slowed, but I've been adding some of my own data.

Here's what we've got so far. My best estimate for X is 15, although 14 is almost equally likely (14.59 best explains the data). Assuming the model is accurate, there's a 95% probability that X lies in the range [12.99, 16.53]. I think we've got good evidence that X is somewhere around 13-17. More data will pinpoint the exact value a little better, but 14 or 15 is not too bad a guess. The estimate has been drifting downwards over time, not sure why, we'll see what happens with continued data collection!

One way to visualize the data is to calculate how often an item came up with X number of a given colored socket. You can see the actual distribution here along with a simulated distribution which you'd predict if my model were true: http://imgur.com/a/mDHCj

As always, take the following updates with a grain of salt. Data collection continues. It's worth mentioning too that this kind of model fitting is very easy to do, but it rests on the assumption that the model is accurate. There are basically two ways to do this. The first is model comparison. Take my model, compare it versus another, see which explains the data better. The second is to use statistical testing to compare my model's theoretical predictions versus what the actual data looks like.

Model comparison:
Bayesian model comparison is a very useful way to compare two different models. The biggest problem with this is a lack of fleshed-out competing models. One thing to consider is that item mods have weights which differ depending on the item type (e.g. axe,sword,chest,ring,etc...) It's possible this is also true for chromes, if this is the case, we'd need much, much more data to prove it since instead of estimating 1 parameter, we'd need to estimate many dozens potentially. Other possible models might instantiate the check to make sure you don't reroll the same exact colors in a different way, perhaps one which biases the item to roll certain color combinations more frequently. With the time that I have, I'm not focusing on model comparison at the moment, but if someone has a proposal I'll definitely test it. Otherwise this won't really happen until I have a better idea in my mind of what a different model might look like.

Statistical testing:
So an obvious question is whether or not my model is the true model. This is a difficult question to answer obviously. The best we can do is to have the model create predictions about what the data should look like, and then compare the actual data to these predictions. Since the data takes the form of frequency counts (i.e. how often did X color combination appear?), we can compare the observed frequencies versus the expected frequencies using a chi-square test. One limit to the chi-square, is that it works best when all expected frequencies are at least 5. This isn't a problem if we want to test how many red sockets we expect, but it is a problem if we are trying to test how often we expect 6 red sockets to appear in the data set. Because of this, I can't run a chi-square test on the distribution presented above. Instead, here are some other comparisons I've been looking at:

Total number of red, green, blue sockets
Observed = [352, 928, 1468]
Expected = [358.4, 934.1, 1455.5] (for X = 15)

The differences here are not statistically significant (p = .8770, chi2 = 0.2624, df = 2)

Number of items with pure colors, only 2 colors, or all 3 colors
Observed = [114, 358, 129] (1 color only, 2 colors, 3 colors)
Expected = [111.6, 370.8, 118.6] (X = 15)

No statistical significance (p = .4959, chi2 = 1.4028, df=2). Not enough evidence to conclude there are more tricolor items in the game than the model would account for.

Number of items rolling 0,1,2,... of each color socket
This is the distribution I presented above. Because items with 5 or 6 sockets of a single color are rare, we can't run a chi-square on the full dataset. We could run it instead with 5 possible outcomes for each color. So for red sockets we might have [# w/ 0 red, # w/ 1, # w/ 2, # w/3, # w/4-6]

This still has one outcome whose expected value is below 5, but the chi2 should still do ok since it's only 1 cell. This came out not significant (p = .7619, chi2 = 4.9598, df = 8). I had originally been calculating the statistic incorrectly (whoops!), but that's fixed now for all the measures.

If you have other aspects of the data you'd like to see tested, please let me know! Currently, it would appear the model fits the data quite well. Well enough that I don't think anything is going to change or that it's particular worth investing time in coming up with competing models. I'll be continuing data collection so we can get a better estimate of the exact value of X (in particular whether 14 or 15 is the better value) so that someone can take that and create a chromatic calculator to estimate the odds for any given socket combo.

Feb 4th
Spoiler
Update! We currently have 513 datapoints. Data collection has slowed, but I've been adding some of my own data. Hopefully I'll be able to get some new visibility during the new leagues. Either way I very much plan on recording all the chromes I end up using!

Here's what we've got so far. My best estimate for X is 15, although 14 is almost equally likely (14.52 best explains the data). Assuming the model is accurate, there's a 95% probability that X lies between [12.83, 16.56]. I think we've got good evidence that X is somewhere around 14-17. More data will pinpoint the exact value a little better, but 14 or 15 is not too bad a guess. The estimate has been drifting downwards over time, not sure why, we'll see what more data indicates!

One way to visualize the data is to calculate how often an item came up with X number of a given colored socket. You can see the actual distribution here along with a simulated distribution which you'd predict if my model were true: http://imgur.com/a/U1Tw5

As always, take the following updates with a grain of salt. Data collection continues. It's worth mentioning too that this kind of model fitting is very easy to do, but it rests on the assumption that the model is accurate. There are basically two ways to do this. The first is model comparison. Take my model, compare it versus another, see which explains the data better. The second is to use statistical testing to compare my model's theoretical predictions versus what the actual data looks like.

Model comparison:
Bayesian model comparison is a very useful way to compare two different models. The biggest problem with this is a lack of fleshed-out competing models. One thing to consider is that item mods have weights which differ depending on the item type (e.g. axe,sword,chest,ring,etc...) It's possible this is also true for chromes, if this is the case, we'd need much, much more data to prove it since instead of estimating 1 parameter, we'd need to estimate many dozens potentially. Other possible models might instantiate the check to make sure you don't reroll the same exact colors in a different way, perhaps one which biases the item to roll certain color combinations more frequently. With the time that I have, I'm not focusing on model comparison at the moment, but if someone has a proposal I'll definitely test it. Otherwise this won't really happen until I have a better idea in my mind of what a different model might look like.

Statistical testing:
So an obvious question is whether or not my model is the true model. This is a difficult question to answer obviously. The best we can do is to have the model create predictions about what the data should look like, and then compare the actual data to these predictions. Since the data takes the form of frequency counts (i.e. how often did X color combination appear?), we can compare the observed frequencies versus the expected frequencies using a chi-square test. One limit to the chi-square, is that it works best when all expected frequencies are at least 5. This isn't a problem if we want to test how many red sockets we expect, but it is a problem if we are trying to test how often we expect 6 red sockets to appear in the data set. Because of this, I can't run a chi-square test on the distribution presented above. Instead, here are some other comparisons I've been looking at:

Total number of red, green, blue sockets
Observed = [301, 877, 1188]
Expected = [300.6, 886.6, 1178.8] (for X = 15)

As you can see, we observed slightly more blues than expected, but fewer greens. This is not statistically significant (p = .4579, chi2 = 0.1761, df = 2)

Number of items with pure colors, only 2 colors, or all 3 colors
Observed = [95, 301, 117] (1 color only, 2 colors, 3 colors)
Expected = [87.6, 318.4, 107.0] (X = 15)

No statistical significance (p = .1424, chi2 = 2.512, df=2). No evidence that there are more tricolor items in the game than the model would account for.

Number of items rolling 0,1,2,... of each color socket
This is the distribution I presented above. Because items with 5 or 6 sockets of a single color are rare, we can't run a chi-square on the full dataset. We could run it instead with 5 possible outcomes for each color. So for red sockets we might have [# w/ 0 red, # w/ 1, # w/ 2, # w/3, # w/4-6]

This still has one outcome whose expected value is below 5, but the chi2 should still do ok since it's only 1 cell. This came out not significant (p = .0992, chi2 = 4.4485, df = 8). What effect there is is driven mostly by the number of items with 3 red sockets (more observed items than predicted). The effect is becoming less significant as we add more data, mostly because the expected value for that cell is rising, which distorts the statistic less. One possible explanation may be that since 3 red sockets was an unexpected outcome for most rolls, that people were rolling chromes in the hopes of getting than many red sockets, and that they therefore stopped rolling chromes once that outcome appeared. This could lead to biased reporting, although with enough data this bias would end up quite small. I'll be keeping an eye on this particular statistic as the amount of data increases, but it seems to be becoming less significant over time.

If you have other aspects of the data you'd like to see tested, please let me know!


Feb 2nd
Spoiler
Update! We currently have 431 datapoints. Data collection has slowed this last week, but I added some of my own data. Hopefully I'll be able to get some new visibility during the new leagues. Either way I very much plan on recording all the chromes I end up using!

Here's what we've got so far. My best estimate for X is 15 (15.17 best explains the data). Assuming the model is accurate, there's a 95% probability that X lies between [13.21, 17.70]. I think we've got good evidence that X is somewhere around 14-17. More data will pinpoint the exact value a little better, but 15 or 16 is not too bad a guess.

One way to visualize the data is to calculate how often an item came up with X number of a given colored socket. You can see the actual distribution here along with a simulated distribution which you'd predict if my model were true: http://imgur.com/a/kY2a7

As always, take the following updates with a grain of salt. Data collection continues. It's worth mentioning too that this kind of model fitting is very easy to do, but it rests on the assumption that the model is accurate. There are basically two ways to do this. The first is model comparison. Take my model, compare it versus another, see which explains the data better. The second is to use statistical testing to compare my model's theoretical predictions versus what the actual data looks like.

Model comparison:
Bayesian model comparison is a very useful way to compare two different models. The biggest problem with this is a lack of fleshed-out competing models. One thing to consider is that item mods have weights which differ depending on the item type (e.g. axe,sword,chest,ring,etc...) It's possible this is also true for chromes, if this is the case, we'd need much, much more data to prove it since instead of estimating 1 parameter, we'd need to estimate many dozens potentially. Other possible models might instantiate the check to make sure you don't reroll the same exact colors in a different way, perhaps one which biases the item to roll certain color combinations more frequently. With the time that I have, I'm not focusing on model comparison at the moment, but if someone has a proposal I'll definitely test it. Otherwise this won't really happen until I have a better idea in my mind of what a different model might look like.

Statistical testing:
So an obvious question is whether or not my model is the true model. This is a difficult question to answer obviously. The best we can do is to have the model create predictions about what the data should look like, and then compare the actual data to these predictions. Since the data takes the form of frequency counts (i.e. how often did X color combination appear?), we can compare the observed frequencies versus the expected frequencies using a chi-square test. One limit to the chi-square, is that it works best when all expected frequencies are at least 5. This isn't a problem if we want to test how many red sockets we expect, but it is a problem if we are trying to test how often we expect 6 red sockets to appear in the data set. Because of this, I can't run a chi-square test on the distribution presented above. Instead, here are some other comparisons I've been looking at:

Total number of red, green, blue sockets
Observed = [244, 824, 930]
Expected = [233.0, 826.5, 938.5] (for X = 15)

As you can see, we observed more reds than expected, but fewer blues. This is not statistically significant (p = .3703, chi2 = 0.6006, df = 2)

Number of items with pure colors, only 2 colors, or all 3 colors
Observed = [72, 255, 104] (1 color only, 2 colors, 3 colors)
Expected = [67.8, 268.2, 95.0] (X = 15)

No statistical significance (p = .2070, chi2 = 1.7636, df=2). No evidence that there are more tricolor items in the game than the model would account for.

Number of items rolling 0,1,2,... of each color socket
This is the distribution I presented above. Because items with 5 or 6 sockets of a single color are rare, we can't run a chi-square on the full dataset. We could run it instead with 5 possible outcomes for each color. So for red sockets we might have [# w/ 0 red, # w/ 1, # w/ 2, # w/3, # w/4-6]

This still has one outcome whose expected value is below 5, but the chi2 should still do ok since it's only 1 cell. This came out marginally significant (p = .0507, chi2 = 11.4741, df = 8). The effect is driven mostly by the number of items with 3 red sockets (more observed items than predicted). This difference might be driven by the lack of pure STR items in the dataset which leads to a small expected value for this outcome. It may also be the case that since 3 red sockets was an unexpected outcome, that people were rolling chromes in the hopes of getting than many red sockets, and that they therefore stopped rolling chromes once that outcome appeared. This could lead to biased reporting, although with enough data this bias would end up quite small. I'll be keeping an eye on this particular statistic as the amount of data increases.

If you have other aspects of the data you'd like to see tested, please let me know!


Jan 27th
Spoiler

Update! We currently have 376 datapoints. I was able to add some, and we've also had some wonderful new volunteers as well! Good job everybody, this is way more than I expected after just a week.


Here's what we've got so far. My best estimate for X is 16 (15.92 best explains the data). Assuming the model is accurate, there's a 95% probability that X lies between [13.72, 18.75]. I think we're beginning to get good evidence that X is somewhere around 14-17. More data will pinpoint the exact value a little better, but 15 or 16 is not too bad a guess.

One way to visualize the data is to calculate how often an item came up with X number of a given colored socket. You can see the actual distribution here along with a simulated distribution which you'd predict if my model were true: http://imgur.com/a/r1ror

As always, take the following updates with a grain of salt. Data collection continues. It's worth mentioning too that this kind of model fitting is very easy to do, but it rests on the assumption that the model is accurate. There are basically two ways to do this. The first is model comparison. Take my model, compare it versus another, see which explains the data better. The second is to use statistical testing to compare my model's theoretical predictions versus what the actual data looks like.

Model comparison:
Bayesian model comparison is a very useful way to compare two different models. The biggest problem with this is a lack of fleshed-out competing models. One thing to consider is that item mods have weights which differ depending on the item type (e.g. axe,sword,chest,ring,etc...) It's possible this is also true for chromes, if this is the case, we'd need much, much more data to prove it since instead of estimating 1 parameter, we'd need to estimate many dozens potentially. Other possible models might instantiate the check to make sure you don't reroll the same exact colors in a different way, perhaps one which biases the item to roll certain color combinations more frequently. With the time that I have, I'm not focusing on model comparison at the moment, but if someone has a proposal I'll definitely test it. Otherwise this won't really happen until I have a better idea in my mind of what a different model might look like.

Statistical testing:
So an obvious question is whether or not my model is the true model. This is a difficult question to answer obviously. The best we can do is to have the model create predictions about what the data should look like, and then compare the actual data to these predictions. Since the data takes the form of frequency counts (i.e. how often did X color combination appear?), we can compare the observed frequencies versus the expected frequencies using a chi-square test. One limit to the chi-square, is that it works best when all expected frequencies are at least 5. This isn't a problem if we want to test how many red sockets we expect, but it is a problem if we are trying to test how often we expect 6 red sockets to appear in the data set. Because of this, I can't run a chi-square test on the distribution presented above. Instead, here are some other comparisons I've been looking at:

Total number of red, green, blue sockets
Observed = [217, 787, 776]
Expected = [203.6, 785.5, 790.8] (for X = 16)

As you can see, we observed more reds than expected, but fewer blues. This is not statistically significant (p = .2805, chi2 = 1.1562, df = 2)

Number of items with pure colors, only 2 colors, or all 3 colors
Observed = [56, 224, 96] (1 color only, 2 colors, 3 colors)
Expected = [54.9, 233.1, 88.0] (X = 16)

No statistical significance (p = .2882, chi2 = 1.1016, df=2). No evidence that there are more tricolor items in the game than the model would account for.

Number of items rolling 0,1,2,... of each color socket
This is the distribution I presented above. Because items with 5 or 6 sockets of a single color are rare, we can't run a chi-square on the full dataset. We could run it instead with 5 possible outcomes for each color. So for red sockets we might have [# w/ 0 red, # w/ 1, # w/ 2, # w/3, # w/4-6]

This still has one outcome whose expected value is below 5, but the chi2 should still do ok since it's only 1 cell. This does come out significant (p = .0300, chi2 = 13.4997, df = 8). The effect is driven mostly by the number of items with 3 red sockets (more observed items than predicted). This difference might be driven by the lack of pure STR items in the dataset which leads to a small expected value for this outcome. It may also be the case that since 3 red sockets was an unexpected outcome, that people were rolling chromes in the hopes of getting than many red sockets, and that they therefore stopped rolling chromes once that outcome appeared. This could lead to biased reporting, although with enough data this bias would end up quite small. I'll be keeping an eye on this particular statistic as the amount of data increases.

If you have other aspects of the data you'd like to see tested, please let me know!


Jan 22nd
Spoiler

Update! We currently have 359 datapoints. I was able to add some, and we've also had some wonderful new volunteers as well! Good job everybody, this is way more than I expected after just a week.

As always, take the following updates with a grain of salt. Data collection continues. It's worth mentioning too that this kind of model fitting is very easy to do, but it rests on the assumption that the model is accurate. Right now I'm basically using visual inspection to make sure the data seems to be fitting well. If someone has another theory about how chromatic orbs work, please let me know! If we can put it into some kind of equation, then we can test which explanation works best.

Here's what we've got so far. My best estimate for X is 16. Assuming the model is accurate, there's a 95% probability that X lies between [13.78, 18.92]. I think we're beginning to get good evidence that X is somewhere around 14-17. More data will pinpoint the exact value a little better, but 15 or 16 is currently not too bad a guess.

You can see a comparison of the actual data distribution versus a simulated distribution here: http://imgur.com/a/geZnP

The first graph is a distribution of socket color rolls from the actual data. Colors are as you would imagine, red=red, green=green etc... Each column indicates how often an item was rolled (in the data) that ended up with N number of (pick your color) sockets. You'll notice that there's a heavy bias in the data for blue and green sockets. Reds appear mostly only once per roll. This is due to the fact that almost all the rolls were for INT, DEX, or INT/DEX items. There are no pure STR items in the corpus at the moment.

The second graph is a simulated distribution of socket rolls based on the items in the data and their previous socket colors. So basically, I removed the "actual" results for each roll, leaving just the item's stats and its current socket colors. For each actual roll, I simulated 1000 rolls. The scale on this graph is the same, although the raw numbers are obviously much higher.


Jan 21st
Spoiler
Woot! We have 130 good datapoints so far.

I'm in the process of moving all the files used to GitHub, so that anyone can look at them, let me know if anything looks wrong.

Here's what I can say so far. The data indicates that X = 15. Confidence interval for that is roughly [10,20], so there's still some fidget room on the exact number.

There's been some skepticism about whether or not the distribution of rolls follows such a simple process. Right now, I can't really test whether or not my model is the best description of the data, simply because there is no other model. Suggestions would be awesome, because there are lots of awesome techniques we could use.

But alas, all we really have is visual inspection. So that's what I'm presenting here:

The first graph is a distribution of socket color rolls from the actual data. Colors are as you would imagine, red=red, green=green etc... Each column indicates how often an item was rolled (in the data) that ended up with N number of (pick your color) sockets. You'll notice that there's a heavy bias in the data for blue and green sockets. Reds appear mostly only once per roll. This is due to the fact that almost all the rolls were for pure INT or hybrid INT/DEX items. There are no pure STR items in the corpus.

The second graph is a simulated distribution of socket rolls based on the items in the data and their previous socket colors. So basically, I removed the "actual" results for each roll, leaving just the item's stats and its current socket colors. For each actual roll, I simulated 1000 rolls. The scale on this graph is the same, although the raw numbers are obviously much higher.

I think the correspondence is quite striking, given the small amount of data we currently possess. The simulated data seems to prefer higher numbers of blue sockets. It underestimates the number of items resulting in only 2 blue sockets, while overestimating the number of items with 4. I'm not going to read too much into this right now, but the trend seems to be that actual items have a wider spread of off-colors than the simulated data. But again, n=130.

I'll be editing this particular post as I update the data. If anything really substantial happens, I'll edit the results into the original post as well.


Last edited by lawphill on Apr 1, 2014, 11:13:30 AM
Bump. Added some data to spreadsheet. I hope you collect more data as this is very interesting.
"

2) If the colors on the freshly rolled item match exactly the colors on the item previously, the entire rolling process is repeated, such that rolling a chromatic will never result in the exact same configuration of socket colors (although the total number of each color may remain the same)



So, for this example we have a 200INT item, with let's say one socket to roll. We use a chromatic. The chance of blue is (200 + X) / (200 + 3X), the chance of green is (X) / (200 + 3X), and the chance of red is the same at (X) / (200 + 3X).


That does not add up. if you have a socket, it has a starting color, so, if it is blue, the real probabilities are X/2X=1/2 for either green and red. if it's not, (200+X)/(200+2X) for blue and X/(200+2X) for the other color.

might be calculated the way you say, then rolled again if it matches starting color, but then, the chances for the color to repeat are zero. so you might want to start there...
"
That does not add up. if you have a socket, it has a starting color, so, if it is blue, the real probabilities are X/2X=1/2 for either green and red. if it's not, (200+X)/(200+2X) for blue and X/(200+2X) for the other color.

might be calculated the way you say, then rolled again if it matches starting color, but then, the chances for the color to repeat are zero. so you might want to start there...


I think the way you want to think about it is to separate the probability of a roll from the probability of getting an item with a specific combination of socket colors. We definitely know that every socket doesn't have to be different than what it was before, only that the exact ordering of all the socket colors can't be the same.

I imagine the way things are calculated is that it rolls each socket ignoring what color that particular socket was beforehand. The probabilities for any color are then (Stat+X)/(sum(stats) + 3X).

It's only at the very end that it would check to see if all the rolls came out the same. This check distorts the probabilities in a way which is harder to analytically calculate. I'm using Monte Carlo simulations to approximate the probabilities after this check, but I'm sure there's an analytical solution as well. The only real trouble is how the check happens. I'm assuming a kind of rejection sampling. You just keep rolling until you get something that isn't what you had before. This is the lazy solution, but it has the nice property that the more sockets (and more work involved in rolling a chrome), the less likely you are to get duplicates. Still, it's possible that the check simply involves taking a random socket and modifying it. This would produce a slightly different distribution, and is something that I might try to run a model comparison over at some point to see if it explains the data better.

Report Forum Post

Report Account:

Report Type

Additional Info