Special versions of Firefox

Discussion of third-party/unofficial Firefox/Thunderbird/SeaMonkey builds.
Locked
ice-pack
Posts: 3
Joined: December 24th, 2005, 5:57 am

Post by ice-pack »

tete009 wrote:
ice-pack wrote:tete,

Here's my spec:
AMD Duron, 700 MHz (3.5 x 200)
Gigabyte GA-7IXE4 (2 ISA, 5 PCI, 1 AGP, 3 DIMM)<---mobo
AMD-750 Irongate<---mobo chipset

Am I right downloading these files ?
- G6 + MMX (Pentium 2, Pentium 3, etc)
- tmemutil-20051212-3dnow.zip (Zip archive, 8KB)
- tmsvc-20051212-3dnow-g6.zip (Zip archive, 340KB)

I hear the Duron processors support <q>Enhanced 3DNow!</q> instructions. I recommend the following file:
tmemutil-20051212-3dnow2.zip (Zip archive, 8KB)

I personally think it is worth trying <strong>G7</strong> on Duron and Athlon.
<q>Visual C++ Optimization Overview</q>
http://msdn.microsoft.com/library/en-us ... zation.asp

Thanks tete. I will try that.
mmoy
Posts: 5030
Joined: February 17th, 2004, 9:05 pm
Location: New Hampshire
Contact:

Post by mmoy »

I think that the best way to get a flavor for how your processor performs is to get the Software Optimization Guide (SOG) for your processor and compare it to the SOG for the Pentium 4 as that's Microsoft's target processor for G7. If you're trying to optimize routine that does a lot of integer division, you can get the instruction latencies from the SOG that you're targeting to decide what approach you want to take based on instruction latencies and throughput.

Optimizing for the Pentium 4 means considering its long pipeline length and the heavy penalties for branch mispredictions. It encourages code generation that is fairly long and that pipelines well with less in the way of branches. It is somewhat of a shame that to use the instructions in the later processors, that you have to specify -arch:SSE as the minimum. It would be really nice to be able to get the benefits of the CMOVxx instruction without having to carry the SSE/SSE2 floating point baggage with it as the CMOVxx instruction can be a good performance win on long pipeline machines. I think that the pipeline length on Athlons is in the low teens and in the high 20s or low 30s for the Pentium 4.

A good example of latency difference is in the Rotate and Shift instructions which are very quick on Athlons and Pentium 3s (1 cycle I think) and 4 cycles on the Pentium 4. There is a limit on OOO parallelism as Rotate and Shift instructions affect the status flags.

tete009 wrote:
ice-pack wrote:tete,

Here's my spec:
AMD Duron, 700 MHz (3.5 x 200)
Gigabyte GA-7IXE4 (2 ISA, 5 PCI, 1 AGP, 3 DIMM)<---mobo
AMD-750 Irongate<---mobo chipset

Am I right downloading these files ?
- G6 + MMX (Pentium 2, Pentium 3, etc)
- tmemutil-20051212-3dnow.zip (Zip archive, 8KB)
- tmsvc-20051212-3dnow-g6.zip (Zip archive, 340KB)

I hear the Duron processors support <q>Enhanced 3DNow!</q> instructions. I recommend the following file:
tmemutil-20051212-3dnow2.zip (Zip archive, 8KB)

I personally think it is worth trying <strong>G7</strong> on Duron and Athlon.
<q>Visual C++ Optimization Overview</q>
http://msdn.microsoft.com/library/en-us ... zation.asp
Dell E521 X2 5600+ MacBookPro 17'' 2.5 Ghz Penryn Dell M1330 2.0 Ghz Merom 4 GB Vista x64 Compaq r3000z AMD 64 3200+ (Win 32/64) PowerMac G5 1.8 Ghz MMOY-1.5 (OSX 10) Inspiron 8500, 4100, 4000, Dimension 2300 MacBook Pro 2.2 Ghz HP E6600 HP X2 4400+
tete009
Posts: 43
Joined: December 11th, 2005, 3:24 pm
Contact:

Post by tete009 »

mmoy, I really appreciate your helpful advice. I think so too.

I haven't made the builds which used the new architecture such as SSE and CMOVcc, because my website doesn't have enough capacity. If I can solve this problem, I'd like to make builds with the <q>-arch:SSE</q> option.

Thank you very much. :)

mmoy wrote:I think that the best way to get a flavor for how your processor performs is to get the Software Optimization Guide (SOG) for your processor and compare it to the SOG for the Pentium 4 as that's Microsoft's target processor for G7. If you're trying to optimize routine that does a lot of integer division, you can get the instruction latencies from the SOG that you're targeting to decide what approach you want to take based on instruction latencies and throughput.

Optimizing for the Pentium 4 means considering its long pipeline length and the heavy penalties for branch mispredictions. It encourages code generation that is fairly long and that pipelines well with less in the way of branches. It is somewhat of a shame that to use the instructions in the later processors, that you have to specify -arch:SSE as the minimum. It would be really nice to be able to get the benefits of the CMOVxx instruction without having to carry the SSE/SSE2 floating point baggage with it as the CMOVxx instruction can be a good performance win on long pipeline machines. I think that the pipeline length on Athlons is in the low teens and in the high 20s or low 30s for the Pentium 4.

A good example of latency difference is in the Rotate and Shift instructions which are very quick on Athlons and Pentium 3s (1 cycle I think) and 4 cycles on the Pentium 4. There is a limit on OOO parallelism as Rotate and Shift instructions affect the status flags.
Dru
Posts: 184
Joined: May 12th, 2005, 3:44 am
Location: Grimsby, England

Post by Dru »

What sort of capacity do you need? I might be able to help.
iMac G5 1.9Ghz | 1.5Gb | OS X 10.4.8
CreativeFlux
mmoy
Posts: 5030
Joined: February 17th, 2004, 9:05 pm
Location: New Hampshire
Contact:

Post by mmoy »

I think that automatically using CMOV is beneficial though it's not clear if SSE is a benefit the way tha t MSVC autovectorizes. All MSVC does is convert f87 instructions to use scalar SSE instructions. The benefit (so I've read) is that the instructions are more compact. But I think that there is a cost to SSE as well. When you load in a scalar floating point, you're loading 32-bits but the SSE instruction has to clear out the additional 96 bits in the rest of the register and I believe that there is a latency cost to this (it's explicitly stated that there is in the Athlon 64 SOG). MSVC doesn't do things like unroll loops. In many cases, GCC on the Altivec and the Intel C Compiler on x86 does a much better job at autovectorization.

So I've been going the other way getting rid of autovectorized SSE and SSE2. MMX actually has an advantage over SSE and SSE2 in that you don't have to worry about the other half of the register. But MMX has the EMMS cost which is substantial on the Pentium 4 but not so much on AMD processors.

tete009 wrote:mmoy, I really appreciate your helpful advice. I think so too.

I haven't made the builds which used the new architecture such as SSE and CMOVcc, because my website doesn't have enough capacity. If I can solve this problem, I'd like to make builds with the <q>-arch:SSE</q> option.

Thank you very much. :)

mmoy wrote:I think that the best way to get a flavor for how your processor performs is to get the Software Optimization Guide (SOG) for your processor and compare it to the SOG for the Pentium 4 as that's Microsoft's target processor for G7. If you're trying to optimize routine that does a lot of integer division, you can get the instruction latencies from the SOG that you're targeting to decide what approach you want to take based on instruction latencies and throughput.

Optimizing for the Pentium 4 means considering its long pipeline length and the heavy penalties for branch mispredictions. It encourages code generation that is fairly long and that pipelines well with less in the way of branches. It is somewhat of a shame that to use the instructions in the later processors, that you have to specify -arch:SSE as the minimum. It would be really nice to be able to get the benefits of the CMOVxx instruction without having to carry the SSE/SSE2 floating point baggage with it as the CMOVxx instruction can be a good performance win on long pipeline machines. I think that the pipeline length on Athlons is in the low teens and in the high 20s or low 30s for the Pentium 4.

A good example of latency difference is in the Rotate and Shift instructions which are very quick on Athlons and Pentium 3s (1 cycle I think) and 4 cycles on the Pentium 4. There is a limit on OOO parallelism as Rotate and Shift instructions affect the status flags.
Dell E521 X2 5600+ MacBookPro 17'' 2.5 Ghz Penryn Dell M1330 2.0 Ghz Merom 4 GB Vista x64 Compaq r3000z AMD 64 3200+ (Win 32/64) PowerMac G5 1.8 Ghz MMOY-1.5 (OSX 10) Inspiron 8500, 4100, 4000, Dimension 2300 MacBook Pro 2.2 Ghz HP E6600 HP X2 4400+
tete009
Posts: 43
Joined: December 11th, 2005, 3:24 pm
Contact:

Post by tete009 »

mmoy wrote:But I think that there is a cost to SSE as well. When you load in a scalar floating point, you're loading 32-bits but the SSE instruction has to clear out the additional 96 bits in the rest of the register and I believe that there is a latency cost to this (it's explicitly stated that there is in the Athlon 64 SOG).

That rings a familiar bell. To tell the truth, I've built a Firefox with the <q>-arch:SSE</q> option before, but I felt the build became slower on Athlon XP. Therefore, I haven't made my builds with the <q>-arch:SSE</q> option, because I have only a Athlon XP machine.

But it's just as you say, the <q>-arch:SSE</q> and <q>-arch:SSE2</q> options have the advantage to remove certain branch instructions. I imagine this is a good optimization for the NetBurst architecture CPU which has deep pipeline.
mmoy
Posts: 5030
Joined: February 17th, 2004, 9:05 pm
Location: New Hampshire
Contact:

Post by mmoy »

The Intel P4 chips generally have better SSE and SSE2 units compared to the AMD chips. Some of the older Athlon chips did a really bad implementation of SSE and even the current generation uses existing units to take care of SSE and SSE2 instructions in two pieces. Intel has a process advantage and a lot more in the way of research and development dollars to make improvements compared to AMD. It would be nice to be able to get the CMOV stuff without the SIMD scalar stuff.
Dell E521 X2 5600+ MacBookPro 17'' 2.5 Ghz Penryn Dell M1330 2.0 Ghz Merom 4 GB Vista x64 Compaq r3000z AMD 64 3200+ (Win 32/64) PowerMac G5 1.8 Ghz MMOY-1.5 (OSX 10) Inspiron 8500, 4100, 4000, Dimension 2300 MacBook Pro 2.2 Ghz HP E6600 HP X2 4400+
tete009
Posts: 43
Joined: December 11th, 2005, 3:24 pm
Contact:

Post by tete009 »

You taught me something! :)
I'd rather hoped Microsoft would be separating the <q>CMOV</q> option from the <q>-arch:SSE</q> option...
Thanks.
mmoy wrote:The Intel P4 chips generally have better SSE and SSE2 units compared to the AMD chips. Some of the older Athlon chips did a really bad implementation of SSE and even the current generation uses existing units to take care of SSE and SSE2 instructions in two pieces. Intel has a process advantage and a lot more in the way of research and development dollars to make improvements compared to AMD. It would be nice to be able to get the CMOV stuff without the SIMD scalar stuff.
tete009
Posts: 43
Joined: December 11th, 2005, 3:24 pm
Contact:

Post by tete009 »

Dru wrote:What sort of capacity do you need? I might be able to help.

Though there is not enough free space in my web site, at present, I hope to solve the problem on my own.
I'm grateful for your kindnesses. :)
User avatar
mahowi
Posts: 569
Joined: September 16th, 2005, 12:37 pm
Location: Germany
Contact:

Post by mahowi »

tete009 wrote:You taught me something! :)
I'd rather hoped Microsoft would be separating the <q>CMOV</q> option from the <q>-arch:SSE</q> option...
Thanks.
mmoy wrote:The Intel P4 chips generally have better SSE and SSE2 units compared to the AMD chips. Some of the older Athlon chips did a really bad implementation of SSE and even the current generation uses existing units to take care of SSE and SSE2 instructions in two pieces. Intel has a process advantage and a lot more in the way of research and development dollars to make improvements compared to AMD. It would be nice to be able to get the CMOV stuff without the SIMD scalar stuff.


I'm currently using "-G7" for my Athlon XP 1800+. Would it be better to use "-G6" or "-G7 -arch:SSE" instead?
Firefox: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9pre) Gecko/2008052316 Firefox/3.0pre (mahowi) ID:2008052316
Thunderbird: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9pre) Gecko/2008052904 Thunderbird/3.0a2pre ID:2008052904
User avatar
bcool
Posts: 638
Joined: December 27th, 2003, 9:01 am
Location: Ozarks

Post by bcool »

Tete, your firefox-1.5-2005123014.en-US.win32-tete009-sse-g6 is very nice on my PIII.

It appears that some patches that have landed on 1.5 (1.8.x branch) during this month are not a part of this build. For example https://bugzilla.mozilla.org/show_bug.cgi?id=317855 =or= https://bugzilla.mozilla.org/show_bug.cgi?id=314814 =or= https://bugzilla.mozilla.org/show_bug.cgi?id=316821 just to cite three at random.

Please clarify for me once again, is this code for your sse-g6 12.30.2005 build based on original code from official Mozilla Firefox 1.5 released on November 11, 2005? Would you consider compiling from updated code if it pleases you?

Thank you for your good contribution. :)

---
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051230 Firefox/1.5 (tete009)
M.Silenus
Posts: 29
Joined: December 9th, 2004, 2:32 am

Post by M.Silenus »

Hi tete009,

do you use mmoy's mmx-patches in your builds like other unoff. builders do? If not, give it a try!
User avatar
bcool
Posts: 638
Joined: December 27th, 2003, 9:01 am
Location: Ozarks

Post by bcool »

I can't speak about all third party builds of course. But with my PIII, I find tete009's mmx-g6 or sse-g6 is every bit as fast (maybe even faster) as stipe's fine builds. Yet tete009 does not include (I believe) mmoy's mmx patches. How does he do it? In tete009's case, will mmoy's patches help or bloat? I wonder. 8-[
Never let them see you sweat
User avatar
bcool
Posts: 638
Joined: December 27th, 2003, 9:01 am
Location: Ozarks

Post by bcool »

tete, do you have plans for a new build? :)
Never let them see you sweat
tete009
Posts: 43
Joined: December 11th, 2005, 3:24 pm
Contact:

Post by tete009 »

I put new builds on my site. They are Firefox 1.5 and Thunderbird 1.5.
I compiled them by using VC++ 2005 Express and VC++ 2003, and applied mmoy's JPEG and BMP patches.
Locked