u3d怎么u3d如何设置ui锚点fpscontrol步伐

New Reviews
New Downloads
New Forum Topics
Does the GeForce GTX 970 have a memory allocation bug ? (update 3)
For a week or two now in our forums there have been allegations that users of the GeForce GTX 970 have a darn hard time addressing and filling the last 10% of their graphics memory. The 4 GB card seems to run into issues addressing the last 400 to 600 MB of memory, which is significant.
Two weeks ago when I tested this myself to try and replicate it, some games halted at 3.5 GB while others like COD fill the 4 GB completely.&These&&have been ongoing for a while now, then got dismissed. However a a new small tool helps us to indicate and verify a thing or two, and there really is something going on with that last chunk of memory for the GeForce GTX 970 and its memory usage. We have to concur the findings, there is a problem that the 970 shows, and the 980 doesn't.
Meanwhile an&Nvidia representative here at the Guru3D forums already stated that "they are looking into it". The tool we are talking about to verify a thing or two was made by a German programmer under the name Nai, he has made a small program that benchmarks vram performance and we can see the 970 memory&utilizing around the 3.3GB, while the GTX 980 does not show such behavior:
You can download the test to try it yourself, we placed it &(local guru3d mirror). This is a customized version based on the original programming by Nai, this one is&programmed by a . With this version&you can now also specify the allocation block size and the maximum memory that is used as follows:vRamBandwidthTest.exe [BlockSizeMB] [MaxAllocationMB]
BlockSizeMB: any number of 16 32 64 128 256 512 1024
MaxAllocationMB: any number greater or equal to BlockSizeMB
If no arguments are given the test runs the 128MB blocksize by default with no memory limit, which corresponds exactly with the original program. Please disable AERO and preferably disconnect the monitor during the test.&We are interested in hearing Nvidia's response to the new findings.&
You can further discuss your &in our forums. Please do share us your GTX 970 and GTX 980 results.
Meanwhile at Nvidia (a chat from a forum user):
[10:11:39 PM] NV Chat: We have our entire team working on this issue with a high priority. This will soon be fixed for sure.[10:11:54 PM] Me: So, what is the issue?[10:12:07 PM] Me: What needs to be fixed?[10:12:46 PM] NV Chat: We are not sure on that. We are still yet to find the cause of this issue.[10:12:50 PM] NV Chat: Our team is working on it.
Update #1 - Nvidia responds
NVIDIA now has responded to the findings:
The GeForce GTX 970 is equipped with 4GB of dedicated graphics memory.& However the 970 has a different configuration of SMs than the 980, and fewer crossbar resources to the memory system. To optimally manage memory traffic in this configuration, we segment graphics memory into a 3.5GB section and a 0.5GB section.& The GPU has higher priority access to the 3.5GB section.& When a game needs less than 3.5GB of video memory per draw command then it will only access the first partition, and 3rdparty applications that measure memory usage will report 3.5GB of memory in use on GTX 970, but may report more for GTX 980 if there is more memory used by other commands. &When a game requires more than 3.5GB of memory then we use both segments.
We understand there have been some questions about how the GTX 970 will perform when it accesses the 0.5GB memory segment.& The best way to test that is to look at game performance.& Compare a GTX 980 to a 970 on a game that uses less than 3.5GB. &Then turn up the settings so the game needs more than 3.5GB and compare 980 and 970 performance again.
Here&s an example of some performance data:
&GeForce&GTX 980GeForce&GTX 970
Shadow of Mordor
&3.5GB setting =
&3.5GB setting =
55fps (-24%)
45fps (-25%)
Battlefield 4
&3.5GB setting = xMSAA
&3.5GB setting = 5% res
19fps (-47%)
15fps (-50%)
Call of Duty: Advanced Warfare
&3.5GB setting =
FSMAA T2x, Supersampling off
&3.5GB setting = &3.5GB setting =
FSMAA T2x, Supersampling on
48fps (-41%)
40fps (-44%)
On GTX 980, Shadows of Mordor drops about 24% on GTX 980 and 25% on GTX 970, a 1% difference.& On Battlefield 4, the drop is 47% on GTX 980 and 50% on GTX 970, a 3% difference.& On CoD: AW, the drop is 41% on GTX 980 and 44% on GTX 970, a 3% difference.& As you can see, there is very little change in the performance of the GTX 970 relative to GTX 980 on these games when it is using the 0.5GB segment.
So removing SMMs to make the GTX 970 a lower spec product over the GTX 980 is the main issue here, 500MB is 1/8t&of the 4GB total memory capacity yeah, two SMMs is 1/8th of the total SMM count. So the answer really is, the primary usable memory for the GTX 970 is a 3.5 GB partition.&
Nvidias results seem to suggest this is a non issue, however&actual users results contradict them.&I'm not quite certain how well this info will sit with GTX 970 owners, as this isn't a bug that can be fixed, it's in design to function that way due to the cut down SMMs.&
Update #2 - A little bit of testing
On a generic notice, I've been using and comparing games with both a 970 and 980 today, and quite honestly I can not really reproduce stutters or weird issues other then the normal stuff once you run out of graphics memory.&Once you run out of ~3.5 GB memory or on the ~4GB GTX 980 slowdowns or weird behavior can occur, but that goes with any graphics card that runs out of video memory.&I've seen 4GB graphics usage with COD, 3.6 GB with Shadows of Mordor with wide varying settings, and simply can not reproduce significant enough anomalies. Once you really run out of graphics memory, perhaps flick down the AA mode a tiny bit from 8x to 4x or something.&I have to state this though, the primary 3.5 GB partition on the GTX 970 with a 500MB slow secondary partition is a big miss from Nvidia, but mostly for not honestly communicating this. The problem I find to be more of a marketing miss with a lot of aftermath due to not mentioning it.Would Nvidia have disclosed the information alongside the launch, then you guys would/could have made a more informed decision. For most of you the primary 3.5 GB graphics memory will be more than plenty in
(Full HD) up-to
Update #3 - The issue that is behind the issue
New info surfaces, Nvidia messed up quite a bit when they send out specs towards press and media like ourselves. As we now know, the GeForce GTX 970 has 56 ROPs, not 64 as listed in their reviewers guides. Having fewer ROPs is not a massive thing here but it exposes a thing or two about effects in the memory subsystem and L2 cache. Combined with some new features in the Maxwell architecture herein we can find the answers of the cards being split up in 3.5GB/0.5GB partions as noted above.&
Look above, (and I am truly sorry to make this so complicated, as it really is just that .. complicated). You'll notice that for GTX 970 compared to 980 there are three disabled SMs giving the GTX 970 13 active SM (clusters with things like shader processors). The SMs shown at the top are followed by&256KB L2 caches and then pairs with 32-bit memory controllers located at the bottom. The crossbar is responsible&for communication inbetween the SM's, cache en and memory controllers.
You will notice that greyed-out right-hand L2 for this GPU right ? That is a disabled L2 block and each L2 block is tied to ROPs, GTX 970 does not have 2,048KB but instead has 1,792KB of L2 cache. Disabling ROPs and thus L2 like that is actually new and Maxwell exclusive, on Kepler disabling a L2/ROP segment would disable the entire section including a memory controller. So while the L2/ROP unit is disabled, that 8th memory controller to the right still is active and in use.
Now that we know that Maxwell can disable smaller segments and keep the rest activated, we just learned that we can still use the 64-bit memory controllers and associated DRAM, but the final 1/8th L2 cache is missing/disabled. As you can see the DRAM controller actually need to buddy up into the 7th L2 unit, that it the root cause of a big performance issue. The GeForce GTX 970 has a 256-bit bus over a 4GB framebuffer, the memory controllers are all active and in use, but disabling that L2 segment tied to the 8th memory controller will result in the fact that overall L2 performance would operate at half of its normal performance.
Nvidia needed to tackle that problem and did so by&splitting the total 4GB memory into a primary (196 GB/sec) 3.5GB partition that makes use of the first seven memory controllers and associated DRAM, then there is a (28 GB/sec) 0.5GB tied to the last 8th memory controller.&Nvidia could have and probably should have marketed the card as 3.5GB, or they probably could even have deactivated an entire right side quad and go for a 192-bit memory interface tied to just 3GB of memory but did not pursue that as alternative as this solution offers better performance.&Nvidia's claims that games hardly suffer from this design / workaround.
In a rough simplified explanation the disabled L2 unit causes a challenge, an offset performance hit tied to one of the memory controllers. To divert that performance hit the memory is split up into two segments, bypassing the issue at hand, a tweak to get the most out of a lesser situation. Both memory partions are active and in use, the primary 3.5 GB partion is very fast, the 512MB secondary partion is much slower.
Thing is, the quantifying fact is that nobody really has massive issues, dozens and dozens of media have tested the card with in-depth reviews like the ones here on my site. Replicating the stutters and stuff you see in some of the video's, well to date I have not been able to reproduce them unless you do crazy stuff, and I've been on this all weekend. Overall scores are good, and sure if you run out of memory at one point you will see perf drops. But then drop from 8 to like 4x AA right ?
Nvidia messed up badly here .. no doubt about it. The ROP/L2 cache count was goofed up and slipped through the mazes and ended up in their reviewers guides and spec sheets, and really ... they should have called this a 3.5 GB card with an extra layer of L3 cache memory or something. Right now Nvidia is in full damage control, however I will stick to my recommendations, the GeForce&GTX 970 is still a card we like very much in the up-to
(WHQD)&domain, but it probably should have been called a 3.5 GB product with an added 512MB L3 cache.
To answer my own title question, does Nvidia have a memory allocation bug ? Nope, this all was done per design, Nvidia however failed to communicate this completely with the tech-media and thus in the end, the people that buy the product.
Let us know your thoughts in&.
Rate this story
2 ... Not the best
3 ... Average
4 ... Good
5 ... Awesome
Originally Posted by JohnLai
I kinda agree, but the control group must run the bench in headless mode!
Not true at all.
The point of a control group is simply to maintain a constant condition.
As long as all &questionable& cards are run under the same conditions as the &control group& then the test is valid.
We should run the test under both conditions to see how things shape up though.
I will say that it should be done after a reboot though and not after the system has been run through a plethora of games.
We do want VRAM as clean as possible outside of testing scenarios presented by Fox where he's asking for a % of memory to be pre-allocated to see how the &benchmark& fairs.
Originally Posted by UZ7
I do notice when I play Unity that it hovers no more than 3500, any changes in AA will cause it to use more and the fps drastically goes down to a crawl.
Adding AA will reduce framerates anyways.
That's been a constant over the years.
Originally Posted by JohnLai
Oh....I see your point.
But who is going to design a program that can pre-allocate the exact required memory before running the bench?
We don't really have to be &exact& here.
Reasonably close will work.
As long as those testing can stay within an acceptable margin of the pre-allocation limit that Fox is asking for, the results should be viable.
If Fox is asking for 1GB, 1.5+ GB would skew the results but 900MB - 1.1GB would be viable.
Originally Posted by sykozis
Adding AA will reduce framerates anyways.
That's been a constant over the years.
Oh I know that, its pretty obvious lol.. But if the game runs at 50-60FPS and when it starts using more than 3500 ram and it hits 5-11FPS then its something else?
Granted Unity is not the greatest game to test as the game is buggy in itself.
Nai's Benchmark
Allocating Memory . . .
Chunk Size: 128 MiByte
Allocated 30 Chunks
Allocated 3840 MiByte
Benchmarking DRAM
DRAM-Bandwidth of Chunk no. 0 (0 MiByte to 128 MiByte):190.08 GByte/s
DRAM-Bandwidth of Chunk no. 1 (128 MiByte to 256 MiByte):189.87 GByte/s
DRAM-Bandwidth of Chunk no. 2 (256 MiByte to 384 MiByte):189.90 GByte/s
DRAM-Bandwidth of Chunk no. 3 (384 MiByte to 512 MiByte):190.43 GByte/s
DRAM-Bandwidth of Chunk no. 4 (512 MiByte to 640 MiByte):190.01 GByte/s
DRAM-Bandwidth of Chunk no. 5 (640 MiByte to 768 MiByte):190.37 GByte/s
DRAM-Bandwidth of Chunk no. 6 (768 MiByte to 896 MiByte):190.35 GByte/s
DRAM-Bandwidth of Chunk no. 7 (896 MiByte to 1024 MiByte):190.26 GByte/s
DRAM-Bandwidth of Chunk no. 8 (1024 MiByte to 1152 MiByte):189.91 GByte/s
DRAM-Bandwidth of Chunk no. 9 (1152 MiByte to 1280 MiByte):190.06 GByte/s
DRAM-Bandwidth of Chunk no. 10 (1280 MiByte to 1408 MiByte):190.24 GByte/s
DRAM-Bandwidth of Chunk no. 11 (1408 MiByte to 1536 MiByte):190.30 GByte/s
DRAM-Bandwidth of Chunk no. 12 (1536 MiByte to 1664 MiByte):190.46 GByte/s
DRAM-Bandwidth of Chunk no. 13 (1664 MiByte to 1792 MiByte):190.03 GByte/s
DRAM-Bandwidth of Chunk no. 14 (1792 MiByte to 1920 MiByte):190.00 GByte/s
DRAM-Bandwidth of Chunk no. 15 (1920 MiByte to 2048 MiByte):190.12 GByte/s
DRAM-Bandwidth of Chunk no. 16 (2048 MiByte to 2176 MiByte):190.35 GByte/s
DRAM-Bandwidth of Chunk no. 17 (2176 MiByte to 2304 MiByte):189.85 GByte/s
DRAM-Bandwidth of Chunk no. 18 (2304 MiByte to 2432 MiByte):189.93 GByte/s
DRAM-Bandwidth of Chunk no. 19 (2432 MiByte to 2560 MiByte):190.30 GByte/s
DRAM-Bandwidth of Chunk no. 20 (2560 MiByte to 2688 MiByte):189.69 GByte/s
DRAM-Bandwidth of Chunk no. 21 (2688 MiByte to 2816 MiByte):190.08 GByte/s
DRAM-Bandwidth of Chunk no. 22 (2816 MiByte to 2944 MiByte):190.47 GByte/s
DRAM-Bandwidth of Chunk no. 23 (2944 MiByte to 3072 MiByte):190.26 GByte/s
DRAM-Bandwidth of Chunk no. 24 (3072 MiByte to 3200 MiByte):189.55 GByte/s
DRAM-Bandwidth of Chunk no. 25 (3200 MiByte to 3328 MiByte):58.57 GByte/s
DRAM-Bandwidth of Chunk no. 26 (3328 MiByte to 3456 MiByte):28.13 GByte/s
DRAM-Bandwidth of Chunk no. 27 (3456 MiByte to 3584 MiByte):28.13 GByte/s
DRAM-Bandwidth of Chunk no. 28 (3584 MiByte to 3712 MiByte):28.13 GByte/s
DRAM-Bandwidth of Chunk no. 29 (3712 MiByte to 3840 MiByte):21.82 GByte/s
Benchmarking L2-Cache
L2-Cache-Bandwidth of Chunk no. 0 (0 MiByte to 128 MiByte):487.12 GByte/s
L2-Cache-Bandwidth of Chunk no. 1 (128 MiByte to 256 MiByte):486.94 GByte/s
L2-Cache-Bandwidth of Chunk no. 2 (256 MiByte to 384 MiByte):487.03 GByte/s
L2-Cache-Bandwidth of Chunk no. 3 (384 MiByte to 512 MiByte):487.11 GByte/s
L2-Cache-Bandwidth of Chunk no. 4 (512 MiByte to 640 MiByte):486.95 GByte/s
L2-Cache-Bandwidth of Chunk no. 5 (640 MiByte to 768 MiByte):487.33 GByte/s
L2-Cache-Bandwidth of Chunk no. 6 (768 MiByte to 896 MiByte):487.00 GByte/s
L2-Cache-Bandwidth of Chunk no. 7 (896 MiByte to 1024 MiByte):487.12 GByte/s
L2-Cache-Bandwidth of Chunk no. 8 (1024 MiByte to 1152 MiByte):487.17 GByte/s
L2-Cache-Bandwidth of Chunk no. 9 (1152 MiByte to 1280 MiByte):487.12 GByte/s
L2-Cache-Bandwidth of Chunk no. 10 (1280 MiByte to 1408 MiByte):486.99 GByte/s
L2-Cache-Bandwidth of Chunk no. 11 (1408 MiByte to 1536 MiByte):487.20 GByte/s
L2-Cache-Bandwidth of Chunk no. 12 (1536 MiByte to 1664 MiByte):487.05 GByte/s
L2-Cache-Bandwidth of Chunk no. 13 (1664 MiByte to 1792 MiByte):487.06 GByte/s
L2-Cache-Bandwidth of Chunk no. 14 (1792 MiByte to 1920 MiByte):486.89 GByte/s
L2-Cache-Bandwidth of Chunk no. 15 (1920 MiByte to 2048 MiByte):487.28 GByte/s
L2-Cache-Bandwidth of Chunk no. 16 (2048 MiByte to 2176 MiByte):487.09 GByte/s
L2-Cache-Bandwidth of Chunk no. 17 (2176 MiByte to 2304 MiByte):487.18 GByte/s
L2-Cache-Bandwidth of Chunk no. 18 (2304 MiByte to 2432 MiByte):486.88 GByte/s
L2-Cache-Bandwidth of Chunk no. 19 (2432 MiByte to 2560 MiByte):487.04 GByte/s
L2-Cache-Bandwidth of Chunk no. 20 (2560 MiByte to 2688 MiByte):487.17 GByte/s
L2-Cache-Bandwidth of Chunk no. 21 (2688 MiByte to 2816 MiByte):486.97 GByte/s
L2-Cache-Bandwidth of Chunk no. 22 (2816 MiByte to 2944 MiByte):487.18 GByte/s
L2-Cache-Bandwidth of Chunk no. 23 (2944 MiByte to 3072 MiByte):486.95 GByte/s
L2-Cache-Bandwidth of Chunk no. 24 (3072 MiByte to 3200 MiByte):487.02 GByte/s
L2-Cache-Bandwidth of Chunk no. 25 (3200 MiByte to 3328 MiByte):174.77 GByte/s
L2-Cache-Bandwidth of Chunk no. 26 (3328 MiByte to 3456 MiByte):87.34 GByte/s
L2-Cache-Bandwidth of Chunk no. 27 (3456 MiByte to 3584 MiByte):87.34 GByte/s
L2-Cache-Bandwidth of Chunk no. 28 (3584 MiByte to 3712 MiByte):87.34 GByte/s
L2-Cache-Bandwidth of Chunk no. 29 (3712 MiByte to 3840 MiByte):27.86 GByte/s
Press any key to continue . . .
Windows 7 headless.
Originally Posted by Fox2232
I did miss it. And it is interesting. There are 6 regions which show this bad performance even while one of test had enough of free vram to accommodate any overhead caused by CUDA error.
Interesting enough, In test where BF was running in background some bad regions were allocated before there was need to do so.
And all bad regions had same performance in both tests 1x better than average, 3x average, 1 worse than average and 1 even worse than that.
At this point I would say that this CUDA test is not defective and truly points towards some issue.
If anyone has environment for compiling CUDA code, I will modify it to eat exactly 3GB of VRAM.
And then victim should try to start in remaining (presumably bad) memory region some game even small which needs like 500MB of VRAM.
Should prove for sure that there is something very bad.
I could compile it and give it some run-time arugments to specify block size, however I found out that this
is not the latest version of the source code.
Where is the version with the additions for readability?
Originally Posted by VultureX
I could compile it and give it some run-time arugments to specify block size, however I found out that this
is not the latest version of the source code.
Where is the version with the additions for readability?
I only know post #34 has latest version.
Post #20 for the source code, not sure if that the latest, it is hidden under spoiler tag.
Originally Posted by nanogenesis
My 970 put through the test.
Nai's Benchmark
Allocating Memory . . .
Chunk Size: 128 MiByte
Allocated 30 Chunks
Allocated 3840 MiByte
Benchmarking DRAM
DRAM-Bandwidth of Chunk no. 0 (0 MiByte to 128 MiByte):190.08 GByte/s
DRAM-Bandwidth of Chunk no. 1 (128 MiByte to 256 MiByte):189.87 GByte/s
DRAM-Bandwidth of Chunk no. 2 (256 MiByte to 384 MiByte):189.90 GByte/s
DRAM-Bandwidth of Chunk no. 3 (384 MiByte to 512 MiByte):190.43 GByte/s
DRAM-Bandwidth of Chunk no. 4 (512 MiByte to 640 MiByte):190.01 GByte/s
DRAM-Bandwidth of Chunk no. 5 (640 MiByte to 768 MiByte):190.37 GByte/s
DRAM-Bandwidth of Chunk no. 6 (768 MiByte to 896 MiByte):190.35 GByte/s
DRAM-Bandwidth of Chunk no. 7 (896 MiByte to 1024 MiByte):190.26 GByte/s
DRAM-Bandwidth of Chunk no. 8 (1024 MiByte to 1152 MiByte):189.91 GByte/s
DRAM-Bandwidth of Chunk no. 9 (1152 MiByte to 1280 MiByte):190.06 GByte/s
DRAM-Bandwidth of Chunk no. 10 (1280 MiByte to 1408 MiByte):190.24 GByte/s
DRAM-Bandwidth of Chunk no. 11 (1408 MiByte to 1536 MiByte):190.30 GByte/s
DRAM-Bandwidth of Chunk no. 12 (1536 MiByte to 1664 MiByte):190.46 GByte/s
DRAM-Bandwidth of Chunk no. 13 (1664 MiByte to 1792 MiByte):190.03 GByte/s
DRAM-Bandwidth of Chunk no. 14 (1792 MiByte to 1920 MiByte):190.00 GByte/s
DRAM-Bandwidth of Chunk no. 15 (1920 MiByte to 2048 MiByte):190.12 GByte/s
DRAM-Bandwidth of Chunk no. 16 (2048 MiByte to 2176 MiByte):190.35 GByte/s
DRAM-Bandwidth of Chunk no. 17 (2176 MiByte to 2304 MiByte):189.85 GByte/s
DRAM-Bandwidth of Chunk no. 18 (2304 MiByte to 2432 MiByte):189.93 GByte/s
DRAM-Bandwidth of Chunk no. 19 (2432 MiByte to 2560 MiByte):190.30 GByte/s
DRAM-Bandwidth of Chunk no. 20 (2560 MiByte to 2688 MiByte):189.69 GByte/s
DRAM-Bandwidth of Chunk no. 21 (2688 MiByte to 2816 MiByte):190.08 GByte/s
DRAM-Bandwidth of Chunk no. 22 (2816 MiByte to 2944 MiByte):190.47 GByte/s
DRAM-Bandwidth of Chunk no. 23 (2944 MiByte to 3072 MiByte):190.26 GByte/s
DRAM-Bandwidth of Chunk no. 24 (3072 MiByte to 3200 MiByte):189.55 GByte/s
DRAM-Bandwidth of Chunk no. 25 (3200 MiByte to 3328 MiByte):58.57 GByte/s
DRAM-Bandwidth of Chunk no. 26 (3328 MiByte to 3456 MiByte):28.13 GByte/s
DRAM-Bandwidth of Chunk no. 27 (3456 MiByte to 3584 MiByte):28.13 GByte/s
DRAM-Bandwidth of Chunk no. 28 (3584 MiByte to 3712 MiByte):28.13 GByte/s
DRAM-Bandwidth of Chunk no. 29 (3712 MiByte to 3840 MiByte):21.82 GByte/s
Benchmarking L2-Cache
L2-Cache-Bandwidth of Chunk no. 0 (0 MiByte to 128 MiByte):487.12 GByte/s
L2-Cache-Bandwidth of Chunk no. 1 (128 MiByte to 256 MiByte):486.94 GByte/s
L2-Cache-Bandwidth of Chunk no. 2 (256 MiByte to 384 MiByte):487.03 GByte/s
L2-Cache-Bandwidth of Chunk no. 3 (384 MiByte to 512 MiByte):487.11 GByte/s
L2-Cache-Bandwidth of Chunk no. 4 (512 MiByte to 640 MiByte):486.95 GByte/s
L2-Cache-Bandwidth of Chunk no. 5 (640 MiByte to 768 MiByte):487.33 GByte/s
L2-Cache-Bandwidth of Chunk no. 6 (768 MiByte to 896 MiByte):487.00 GByte/s
L2-Cache-Bandwidth of Chunk no. 7 (896 MiByte to 1024 MiByte):487.12 GByte/s
L2-Cache-Bandwidth of Chunk no. 8 (1024 MiByte to 1152 MiByte):487.17 GByte/s
L2-Cache-Bandwidth of Chunk no. 9 (1152 MiByte to 1280 MiByte):487.12 GByte/s
L2-Cache-Bandwidth of Chunk no. 10 (1280 MiByte to 1408 MiByte):486.99 GByte/s
L2-Cache-Bandwidth of Chunk no. 11 (1408 MiByte to 1536 MiByte):487.20 GByte/s
L2-Cache-Bandwidth of Chunk no. 12 (1536 MiByte to 1664 MiByte):487.05 GByte/s
L2-Cache-Bandwidth of Chunk no. 13 (1664 MiByte to 1792 MiByte):487.06 GByte/s
L2-Cache-Bandwidth of Chunk no. 14 (1792 MiByte to 1920 MiByte):486.89 GByte/s
L2-Cache-Bandwidth of Chunk no. 15 (1920 MiByte to 2048 MiByte):487.28 GByte/s
L2-Cache-Bandwidth of Chunk no. 16 (2048 MiByte to 2176 MiByte):487.09 GByte/s
L2-Cache-Bandwidth of Chunk no. 17 (2176 MiByte to 2304 MiByte):487.18 GByte/s
L2-Cache-Bandwidth of Chunk no. 18 (2304 MiByte to 2432 MiByte):486.88 GByte/s
L2-Cache-Bandwidth of Chunk no. 19 (2432 MiByte to 2560 MiByte):487.04 GByte/s
L2-Cache-Bandwidth of Chunk no. 20 (2560 MiByte to 2688 MiByte):487.17 GByte/s
L2-Cache-Bandwidth of Chunk no. 21 (2688 MiByte to 2816 MiByte):486.97 GByte/s
L2-Cache-Bandwidth of Chunk no. 22 (2816 MiByte to 2944 MiByte):487.18 GByte/s
L2-Cache-Bandwidth of Chunk no. 23 (2944 MiByte to 3072 MiByte):486.95 GByte/s
L2-Cache-Bandwidth of Chunk no. 24 (3072 MiByte to 3200 MiByte):487.02 GByte/s
L2-Cache-Bandwidth of Chunk no. 25 (3200 MiByte to 3328 MiByte):174.77 GByte/s
L2-Cache-Bandwidth of Chunk no. 26 (3328 MiByte to 3456 MiByte):87.34 GByte/s
L2-Cache-Bandwidth of Chunk no. 27 (3456 MiByte to 3584 MiByte):87.34 GByte/s
L2-Cache-Bandwidth of Chunk no. 28 (3584 MiByte to 3712 MiByte):87.34 GByte/s
L2-Cache-Bandwidth of Chunk no. 29 (3712 MiByte to 3840 MiByte):27.86 GByte/s
Press any key to continue . . .
Windows 7 headless.
So, nanogenesis, do you have gtx 980 for testing (maybe loan it from your friend? XD) so that we may have comparison between 970 and 980?
Originally Posted by JohnLai
Oh....I see your point.
But who is going to design a program that can pre-allocate the exact required memory before running the bench?
If anyone has CUDA compiler, I can simply modify Nai's code to reserve desired amount.
I would like to really see pre-allocated 3GB and instead of benching running game in remaining part.
(something small to fit around 500MB)
And btw, those changes Nai made there are mainly to increase accuracy of bench (number of cycles) and changed benching method a bit in those elements where I was not sure about (this stays, I simply do not know inner CUDA workings).
Originally Posted by UZ7
Oh I know that, its pretty obvious lol.. But if the game runs at 50-60FPS and when it starts using more than 3500 ram and it hits 5-11FPS then its something else?
Granted Unity is not the greatest game to test as the game is buggy in itself.
Anvil is notorious for AA causing all kinds of issues with frame rates. If I ran MSAA in black flag it would load up one core on my cpu and cripple the frames but if I used TXAA I would get a constant 60FPS and an more even load on the cpu.
#include &cuda_runtime.h&
#include &device_launch_parameters.h&
#include &helper_math.h&
#include &stdio.h&
#include &iostream&
#define CacheCount 5
__global__ void BenchMarkDRAMKernel(float4* In, int Float4Count)
int ThreadID = (blockDim.x *blockIdx.x + threadIdx.x) % Float4C
float4 Temp = make_float4(1);
Temp += In[ThreadID];
if (length(Temp) == -12354)
__global__ void BenchMarkCacheKernel(float4* In, int Zero,int Float4Count)
int ThreadID = (blockDim.x *blockIdx.x + threadIdx.x) % Float4C
float4 Temp = make_float4(1);
#pragma unroll
for (int i = 0; i & CacheC i++)
Temp += In[ThreadID + i*Zero];
if (length(Temp) == -12354)
int main()
static const int PointerCount = 5000;
int Float4Count = 4 * 1024 * 1024;
int ChunkSize = Float4Count*sizeof(float4);
int ChunkSizeMB = (ChunkSize / 1024) / 1024;
float4* Pointers[PointerCount];
int UsedPointers = 0;
printf(&Nai's Benchmark \n&);
printf(&Allocating Memory . . . \nChunk Size: %i MiByte \n&, ChunkSizeMB);
while (true)
int Error = cudaMalloc(&Pointers[UsedPointers], ChunkSize);
if (Error == cudaErrorMemoryAllocation)
cudaMemset(Pointers[UsedPointers], 0, ChunkSize);
UsedPointers++;
printf(&Allocated %i Chunks \n&, UsedPointers);
printf(&Allocated %i MiByte \n&, ChunkSizeMB*UsedPointers);
cudaEvent_t start,
cudaEventCreate(&start);
cudaEventCreate(&stop);
int BlockSize = 64;
int BenchmarkCount = 30;
int BlockCount = BenchmarkCount * Float4Count / BlockS
printf(&Benchmarking DRAM \n&);
for (int i = 0; i & UsedP i++)
cudaEventRecord(start);
BenchMarkDRAMKernel && &BlockCount, BlockSize && &(Pointers[i], Float4Count);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
float Bandwidth = ((float)(BenchmarkCount)* (float)(ChunkSize)) / milliseconds / 1000.f / 1000.f;
printf(&DRAM-Bandwidth of Chunk no. %i (%i MiByte to %i MiByte):%5.2f GByte/s \n&, i, ChunkSizeMB*i, ChunkSizeMB*(i + 1), Bandwidth);
printf(&Benchmarking L2-Cache \n&);
for (int i = 0; i & UsedP i++)
cudaEventRecord(start);
BenchMarkCacheKernel && &BlockCount, BlockSize && &(Pointers[i], 0, Float4Count);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
float Bandwidth = (((float)CacheCount* (float)BenchmarkCount * (float)ChunkSize)) / milliseconds / 1000.f / 1000.f;
printf(&L2-Cache-Bandwidth of Chunk no. %i (%i MiByte to %i MiByte):%5.2f GByte/s \n&, i, ChunkSizeMB*i, ChunkSizeMB*(i + 1), Bandwidth);
system(&pause&);
cudaDeviceSynchronize();
cudaDeviceReset();
This version should allocate exactly 3GB and no more. Or stop if there is not enough to allocate.
&Click to show spoiler
#include &cuda_runtime.h&
#include &device_launch_parameters.h&
#include &helper_math.h&
#include &stdio.h&
#include &iostream&
#define CacheCount 5
__global__ void BenchMarkDRAMKernel(float4* In, int Float4Count)
int ThreadID = (blockDim.x *blockIdx.x + threadIdx.x) % Float4C
float4 Temp = make_float4(1);
Temp += In[ThreadID];
if (length(Temp) == -12354)
__global__ void BenchMarkCacheKernel(float4* In, int Zero,int Float4Count)
int ThreadID = (blockDim.x *blockIdx.x + threadIdx.x) % Float4C
float4 Temp = make_float4(1);
#pragma unroll
for (int i = 0; i & CacheC i++)
Temp += In[ThreadID + i*Zero];
if (length(Temp) == -12354)
int main()
static const int PointerCount = 5000;
int Float4Count = 8 * 1024 * 1024;
int ChunkSize = Float4Count*sizeof(float4);
int ChunkSizeMB = (ChunkSize / 1024) / 1024;
float4* Pointers[PointerCount];
int UsedPointers = 0;
printf(&Nai's Benchmark \n&);
printf(&Allocating Memory . . . \nChunk Size: %i MiByte \n&, ChunkSizeMB);
while (UsedPointers & 24)
int Error = cudaMalloc(&Pointers[UsedPointers], ChunkSize);
if (Error == cudaErrorMemoryAllocation)
cudaMemset(Pointers[UsedPointers], 0, ChunkSize);
UsedPointers++;
printf(&Allocated %i Chunks \n&, UsedPointers);
printf(&Allocated %i MiByte \n&, ChunkSizeMB*UsedPointers);
cudaEvent_t start,
cudaEventCreate(&start);
cudaEventCreate(&stop);
int BlockSize = 128;
int BenchmarkCount = 30;
int BlockCount = BenchmarkCount * Float4Count / BlockS
printf(&Benchmarking DRAM \n&);
for (int i = 0; i & UsedP i++)
cudaEventRecord(start);
BenchMarkDRAMKernel && &BlockCount, BlockSize && &(Pointers[i], Float4Count);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
float Bandwidth = ((float)(BenchmarkCount)* (float)(ChunkSize)) / milliseconds / 1000.f / 1000.f;
printf(&DRAM-Bandwidth of Chunk no. %i (%i MiByte to %i MiByte):%5.2f GByte/s \n&, i, ChunkSizeMB*i, ChunkSizeMB*(i + 1), Bandwidth);
printf(&Benchmarking L2-Cache \n&);
for (int i = 0; i & UsedP i++)
cudaEventRecord(start);
BenchMarkCacheKernel && &BlockCount, BlockSize && &(Pointers[i], 0, Float4Count);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
float Bandwidth = (((float)CacheCount* (float)BenchmarkCount * (float)ChunkSize)) / milliseconds / 1000.f / 1000.f;
printf(&L2-Cache-Bandwidth of Chunk no. %i (%i MiByte to %i MiByte):%5.2f GByte/s \n&, i, ChunkSizeMB*i, ChunkSizeMB*(i + 1), Bandwidth);
system(&pause&);
cudaDeviceSynchronize();
cudaDeviceReset();
Try to run 3GB version, check if 3GB are taken and if so, launch some smaller game. If not, run one bench on check if it stays allocated after bench, since I am not sure if regions get freed.
Nai's Benchmark
Allocating Memory . . .
Chunk Size: 128 MiByte
Allocated 30 Chunks
Allocated 3840 MiByte
Benchmarking DRAM
DRAM-Bandwidth of Chunk no. 0 (0 MiByte to 128 MiByte):148.57 GByte/s
DRAM-Bandwidth of Chunk no. 23 (2944 MiByte to 3072 MiByte):150.48 GByte/s
DRAM-Bandwidth of Chunk no. 24 (3072 MiByte to 3200 MiByte):33.52 GByte/s
DRAM-Bandwidth of Chunk no. 25 (3200 MiByte to 3328 MiByte):22.35 GByte/s
DRAM-Bandwidth of Chunk no. 26 (3328 MiByte to 3456 MiByte):22.35 GByte/s
DRAM-Bandwidth of Chunk no. 27 (3456 MiByte to 3584 MiByte):22.35 GByte/s
DRAM-Bandwidth of Chunk no. 28 (3584 MiByte to 3712 MiByte): 7.89 GByte/s
DRAM-Bandwidth of Chunk no. 29 (3712 MiByte to 3840 MiByte): 8.44 GByte/s
Benchmarking L2-Cache
L2-Cache-Bandwidth of Chunk no. 0 (0 MiByte to 128 MiByte):418.70 GByte/s
L2-Cache-Bandwidth of Chunk no. 23 (2944 MiByte to 3072 MiByte):418.72 GByte/s
L2-Cache-Bandwidth of Chunk no. 24 (3072 MiByte to 3200 MiByte):111.02 GByte/s
L2-Cache-Bandwidth of Chunk no. 25 (3200 MiByte to 3328 MiByte):75.46 GByte/s
L2-Cache-Bandwidth of Chunk no. 26 (3328 MiByte to 3456 MiByte):75.46 GByte/s
L2-Cache-Bandwidth of Chunk no. 27 (3456 MiByte to 3584 MiByte):75.46 GByte/s
L2-Cache-Bandwidth of Chunk no. 28 (3584 MiByte to 3712 MiByte): 1.#J GByte/s
L2-Cache-Bandwidth of Chunk no. 29 (3712 MiByte to 3840 MiByte): 1.#J GByte/s
Tried 344.16 and 347.09. 347.09 never completed for me and 344.16 crashed four out of five times. is this &normal& as well?
Post New Comment
to post a comment for this news story on the message forum.

我要回帖

更多关于 u3d fog设置 的文章

 

随机推荐