Could you educate me on how 3d graphics utilize computer resources?

This might be a good starting point: https://www.khronos.org/opengl/wiki/Rendering_Pipeline_Overview

The sort of 3D graphics I've done have never really been on the cutting edge, so I don't yet have firsthand experience doing hard optimizations for real-world cases. Someone who has might have more insight, but here's how I understand things now:

- At the highest level, work is split between the CPU and the GPU. A typical modern CPUs has maybe 8 to 16 cores that can each independently execute their instruction set, which is well suited for general purpose sequential processing of instructions, where branching is relatively cheap. In contrast, a typical modern GPU has many more cores (hundreds?), which are far more specialized just to the particular tasks needed to do their job. GPUs are good at doing short repetitive tasks that are easily parallelizable, ideally with minimal branching.

- Most of a program's code runs on the CPU. In order to draw graphics, some data and instructions need to be sent into the GPU's pipeline to delegate rendering tasks to it. User input, event handling, game logic computations, etc. will share time with the needs of feeding the graphics pipeline each frame, though through multithreading it can be possible to parallelize parts of this process by using more CPU cores. Collision detection, AI, and other core game logic would typically be CPU tasks.

- The amount of CPU time needed to feed the GPU depends on a lot of factors. Immediate mode in OpenGL version 1 involved making multiple GL function calls per vertex to specify data, to the point where function call overhead was a performance concern. GL extensions and core features from version 2 onward alleviated this by introducing Vertex Buffer Objects (VBOs) to upload large blocks of packed data all at once to be stored in VRAM, which could be invoked with a single call to glDrawElements(), rather than calling glColor(), glVertex(), etc. many times per frame. Note that this means static data can be uploaded once and referenced for the lifetime of the program, rather than sending it every frame.

- Once the GPU has been fed some data to process, it goes through the pipeline as shown in the article above to produce an image output (or other data output if used for a different kind of computation). Old OpenGL had what was known as the fixed function pipeline, which worked like a series of configuration options that could be set to determine how incoming data is rasterized. This was quite limited and not very performant, so the modern way is to instead have the CPU upload programs for the GPU to use to interpret its data, known as shaders. The most basic shader would consist of a vertex program, which runs once per input vertex, and a fragment program, which runs once per output fragment/pixel. These are written in a shader language like GLSL and compiled to execute natively using the GPU's instruction set. The GPU can store several of these programs at a time, and the program used for rendering an object can be specified with a glUseProgram() call prior to glDrawElements() - something that can be done multiple times per frame, but ideally as few as possible.

- Textures reside in separate blocks of data that also take up some amount of VRAM proportional to their resolution. Mipmaps may be involved - a 2048x1024 texture might have downscaled representations at 1024x512, 512x256, etc. all the way down to 1x1. The point of mipmaps is to only have to sample at most 4 texels to determine an output color, since otherwise, drawing a high resolution texture at a very small size would require examining many of its pixels in order to determine an accurate average color. This is known as trilinear interpolation, since in addition to interpolating between color values on the x and y axes, the mipmap level is also determined by the screen space taken by the output. Textures can also be sampled with nearest neighbor interpolation to get the sharp pixely look as seen in games like Minecraft, but trilinear interpolation is more common for most use cases.

So, now we can finally get to identifying potential bottlenecks. The CPU needs to not be kept too busy with its other tasks so that it'll have time to bus the necessary data over to the GPU. Any data that can be sent once and left static in VRAM saves on the cost of uploading it every frame. The size of data sent over makes a difference, so less is better, particularly for dynamic things that change frequently. Data to be sent mostly consists of vertex attribute buffers (VBOs), texture images, and potentially uniform blocks for shaders to interpret. Shader programs themselves are pretty small and don't need to be changed frequently, so I'd assume they're not much of a concern in a normal case. So, smaller textures, fewer unique vertices, and less frequent changes to data in VRAM are things to target for savings on CPU to GPU upload costs.

Over on the GPU side, a vertex program will execute once per vertex, and a fragment program will execute once per fragment. The complexity of the shader program is therefore multiplied by the amount of data it needs to process - fewer input vertices means fewer invocations of the vertex program, a shorter vertex program means more inputs it can process, fewer output fragments means fewer invocations of the fragment program, and a shorter fragment program means more outputs it can process.

In addition to optimizing models for fewer vertices and choosing the lowest appropriate output resolution, fewer costs may be able to be paid by changing how data is organized. Techniques like frustum culling can be used to avoid submitting vertex data if the CPU can cheaply determine that it would be outside the rendered view, thereby not requiring the GPU to run the vertex program and clipping on those vertices only to realize that it needs to discard them. Faraway objects that are fully occluded by nearer objects could be omitted if the result would be that they're discarded by the depth buffer, thereby avoiding having to send them through the fragment shader. Translucent geometry can be particularly costly, since it requires fragments to be evaluated multiple times to blend them together into the final output - in terms of fragment processing cost, drawing two full-screen translucent quads would be like drawing a single screen of an opaque quad twice.

As far as texture size goes, I think it's more of a concern for initial upload and VRAM usage than for rendering performance. From what I've seen, sampling a 2048x2048 texture on the GPU seems like it's around the same time cost as sampling a 1x1 texture, just due to the way things are organized. There might be cases where this isn't true, so definitely don't take my word on that. Trilinear filtering involves a little bit more work than nearest neighbor, of course. There's also anisotropic filtering, the workings of which I don't actually understand, but it makes things look sharper and cleaner when seen at steep angles.

So, to wrap things up and tie it all back to the coin example, my interpretation would be that this is a relatively minor but probably meaningful optimization. The coin model would likely be static data that's drawn with instancing or something similar. A coin would be a pretty easy culling target, though there might be a whole lot of them to process in one scene at a time. Presumably most things drawn in the world are already using normal mapping, so it might be cheaper to just use the same technique for drawing coins rather than to switch to a different shader just for them. I saw an in-depth breakdown of Breath of the Wild's rendering once, and it seemed like everything that exists in the world was drawn with one of around 10 different shaders that were in active use - keeping that number small would probably be a performance priority. The only extra cost I can clearly see would come from the VRAM use of the normal and roughness map textures that are specific to coins - unless coins are the only object in the world that use the shader that they use, in which case the cost of switching shaders would also need to be paid with a higher complexity model.