Here's my spec: AMD Duron, 700 MHz (3.5 x 200) Gigabyte GA-7IXE4 (2 ISA, 5 PCI, 1 AGP, 3 DIMM)<---mobo AMD-750 Irongate<---mobo chipset
Am I right downloading these files ? - G6 + MMX (Pentium 2, Pentium 3, etc) - tmemutil-20051212-3dnow.zip (Zip archive, 8KB) - tmsvc-20051212-3dnow-g6.zip (Zip archive, 340KB)
I hear the Duron processors support <q>Enhanced 3DNow!</q> instructions. I recommend the following file: tmemutil-20051212-3dnow2.zip (Zip archive, 8KB)
I think that the best way to get a flavor for how your processor performs is to get the Software Optimization Guide (SOG) for your processor and compare it to the SOG for the Pentium 4 as that's Microsoft's target processor for G7. If you're trying to optimize routine that does a lot of integer division, you can get the instruction latencies from the SOG that you're targeting to decide what approach you want to take based on instruction latencies and throughput.
Optimizing for the Pentium 4 means considering its long pipeline length and the heavy penalties for branch mispredictions. It encourages code generation that is fairly long and that pipelines well with less in the way of branches. It is somewhat of a shame that to use the instructions in the later processors, that you have to specify -arch:SSE as the minimum. It would be really nice to be able to get the benefits of the CMOVxx instruction without having to carry the SSE/SSE2 floating point baggage with it as the CMOVxx instruction can be a good performance win on long pipeline machines. I think that the pipeline length on Athlons is in the low teens and in the high 20s or low 30s for the Pentium 4.
A good example of latency difference is in the Rotate and Shift instructions which are very quick on Athlons and Pentium 3s (1 cycle I think) and 4 cycles on the Pentium 4. There is a limit on OOO parallelism as Rotate and Shift instructions affect the status flags.
tete009 wrote:
ice-pack wrote:tete,
Here's my spec: AMD Duron, 700 MHz (3.5 x 200) Gigabyte GA-7IXE4 (2 ISA, 5 PCI, 1 AGP, 3 DIMM)<---mobo AMD-750 Irongate<---mobo chipset
Am I right downloading these files ? - G6 + MMX (Pentium 2, Pentium 3, etc) - tmemutil-20051212-3dnow.zip (Zip archive, 8KB) - tmsvc-20051212-3dnow-g6.zip (Zip archive, 340KB)
I hear the Duron processors support <q>Enhanced 3DNow!</q> instructions. I recommend the following file: tmemutil-20051212-3dnow2.zip (Zip archive, 8KB)
mmoy, I really appreciate your helpful advice. I think so too.
I haven't made the builds which used the new architecture such as SSE and CMOVcc, because my website doesn't have enough capacity. If I can solve this problem, I'd like to make builds with the <q>-arch:SSE</q> option.
Thank you very much.
mmoy wrote:I think that the best way to get a flavor for how your processor performs is to get the Software Optimization Guide (SOG) for your processor and compare it to the SOG for the Pentium 4 as that's Microsoft's target processor for G7. If you're trying to optimize routine that does a lot of integer division, you can get the instruction latencies from the SOG that you're targeting to decide what approach you want to take based on instruction latencies and throughput.
Optimizing for the Pentium 4 means considering its long pipeline length and the heavy penalties for branch mispredictions. It encourages code generation that is fairly long and that pipelines well with less in the way of branches. It is somewhat of a shame that to use the instructions in the later processors, that you have to specify -arch:SSE as the minimum. It would be really nice to be able to get the benefits of the CMOVxx instruction without having to carry the SSE/SSE2 floating point baggage with it as the CMOVxx instruction can be a good performance win on long pipeline machines. I think that the pipeline length on Athlons is in the low teens and in the high 20s or low 30s for the Pentium 4.
A good example of latency difference is in the Rotate and Shift instructions which are very quick on Athlons and Pentium 3s (1 cycle I think) and 4 cycles on the Pentium 4. There is a limit on OOO parallelism as Rotate and Shift instructions affect the status flags.
I think that automatically using CMOV is beneficial though it's not clear if SSE is a benefit the way tha t MSVC autovectorizes. All MSVC does is convert f87 instructions to use scalar SSE instructions. The benefit (so I've read) is that the instructions are more compact. But I think that there is a cost to SSE as well. When you load in a scalar floating point, you're loading 32-bits but the SSE instruction has to clear out the additional 96 bits in the rest of the register and I believe that there is a latency cost to this (it's explicitly stated that there is in the Athlon 64 SOG). MSVC doesn't do things like unroll loops. In many cases, GCC on the Altivec and the Intel C Compiler on x86 does a much better job at autovectorization.
So I've been going the other way getting rid of autovectorized SSE and SSE2. MMX actually has an advantage over SSE and SSE2 in that you don't have to worry about the other half of the register. But MMX has the EMMS cost which is substantial on the Pentium 4 but not so much on AMD processors.
tete009 wrote:mmoy, I really appreciate your helpful advice. I think so too.
I haven't made the builds which used the new architecture such as SSE and CMOVcc, because my website doesn't have enough capacity. If I can solve this problem, I'd like to make builds with the <q>-arch:SSE</q> option.
Thank you very much.
mmoy wrote:I think that the best way to get a flavor for how your processor performs is to get the Software Optimization Guide (SOG) for your processor and compare it to the SOG for the Pentium 4 as that's Microsoft's target processor for G7. If you're trying to optimize routine that does a lot of integer division, you can get the instruction latencies from the SOG that you're targeting to decide what approach you want to take based on instruction latencies and throughput.
Optimizing for the Pentium 4 means considering its long pipeline length and the heavy penalties for branch mispredictions. It encourages code generation that is fairly long and that pipelines well with less in the way of branches. It is somewhat of a shame that to use the instructions in the later processors, that you have to specify -arch:SSE as the minimum. It would be really nice to be able to get the benefits of the CMOVxx instruction without having to carry the SSE/SSE2 floating point baggage with it as the CMOVxx instruction can be a good performance win on long pipeline machines. I think that the pipeline length on Athlons is in the low teens and in the high 20s or low 30s for the Pentium 4.
A good example of latency difference is in the Rotate and Shift instructions which are very quick on Athlons and Pentium 3s (1 cycle I think) and 4 cycles on the Pentium 4. There is a limit on OOO parallelism as Rotate and Shift instructions affect the status flags.
mmoy wrote:But I think that there is a cost to SSE as well. When you load in a scalar floating point, you're loading 32-bits but the SSE instruction has to clear out the additional 96 bits in the rest of the register and I believe that there is a latency cost to this (it's explicitly stated that there is in the Athlon 64 SOG).
That rings a familiar bell. To tell the truth, I've built a Firefox with the <q>-arch:SSE</q> option before, but I felt the build became slower on Athlon XP. Therefore, I haven't made my builds with the <q>-arch:SSE</q> option, because I have only a Athlon XP machine.
But it's just as you say, the <q>-arch:SSE</q> and <q>-arch:SSE2</q> options have the advantage to remove certain branch instructions. I imagine this is a good optimization for the NetBurst architecture CPU which has deep pipeline.
The Intel P4 chips generally have better SSE and SSE2 units compared to the AMD chips. Some of the older Athlon chips did a really bad implementation of SSE and even the current generation uses existing units to take care of SSE and SSE2 instructions in two pieces. Intel has a process advantage and a lot more in the way of research and development dollars to make improvements compared to AMD. It would be nice to be able to get the CMOV stuff without the SIMD scalar stuff.
You taught me something!
I'd rather hoped Microsoft would be separating the <q>CMOV</q> option from the <q>-arch:SSE</q> option...
Thanks.
mmoy wrote:The Intel P4 chips generally have better SSE and SSE2 units compared to the AMD chips. Some of the older Athlon chips did a really bad implementation of SSE and even the current generation uses existing units to take care of SSE and SSE2 instructions in two pieces. Intel has a process advantage and a lot more in the way of research and development dollars to make improvements compared to AMD. It would be nice to be able to get the CMOV stuff without the SIMD scalar stuff.
tete009 wrote:You taught me something! I'd rather hoped Microsoft would be separating the <q>CMOV</q> option from the <q>-arch:SSE</q> option... Thanks.
mmoy wrote:The Intel P4 chips generally have better SSE and SSE2 units compared to the AMD chips. Some of the older Athlon chips did a really bad implementation of SSE and even the current generation uses existing units to take care of SSE and SSE2 instructions in two pieces. Intel has a process advantage and a lot more in the way of research and development dollars to make improvements compared to AMD. It would be nice to be able to get the CMOV stuff without the SIMD scalar stuff.
I'm currently using "-G7" for my Athlon XP 1800+. Would it be better to use "-G6" or "-G7 -arch:SSE" instead?
Firefox: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9pre) Gecko/2008052316 Firefox/3.0pre (mahowi) ID:2008052316 Thunderbird: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9pre) Gecko/2008052904 Thunderbird/3.0a2pre ID:2008052904
Please clarify for me once again, is this code for your sse-g6 12.30.2005 build based on original code from official Mozilla Firefox 1.5 released on November 11, 2005? Would you consider compiling from updated code if it pleases you?
Thank you for your good contribution.
---
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051230 Firefox/1.5 (tete009)
I can't speak about all third party builds of course. But with my PIII, I find tete009's mmx-g6 or sse-g6 is every bit as fast (maybe even faster) as stipe's fine builds. Yet tete009 does not include (I believe) mmoy's mmx patches. How does he do it? In tete009's case, will mmoy's patches help or bloat? I wonder.
I put new builds on my site. They are Firefox 1.5 and Thunderbird 1.5.
I compiled them by using VC++ 2005 Express and VC++ 2003, and applied mmoy's JPEG and BMP patches.
mozillaZine is an independent Mozilla community and advocacy site. We're not affiliated or endorsed by the Mozilla Corporation but we love them just the same.