Transparent Page Sharing – Reduce Your Memory Footprint (Homelab recommended)

In late 2014, VMware changed their default stance on the use of TPS to share memory between Virtual Machines as a consequence of a low risk security threat. Given that VMware needed to remain squeaky clean on security compliance, they duly changed the out of the box way TPS worked forever which for some customers meant that memory budgets and allocations that were architected based upon the old implementation needed to be reconsidered. Views on whether you should or shouldn’t change the default behaviour of VMware to revert back to what is an extremely efficient way of sharing memory are discussions that need to be had with corporate security officers but for homelabs, where YOU wear the security hat (amongst many others), why not benefit from reclaiming some much needed RAM.

Reasons to consider whether or not to do this can be found on numerous other blog posts, but as a quick and dirty guide, here are the steps you need to follow to restore the original mechanism:

On each host within your cluster, through the Web Client, click on Manage –> Settings –> Advanced System Settings. Locate Mem.ShareForceSalting and change the default value of 2 to 0 (zero)

vMotion the workloads off or power them off and on again for the change to take effect!

To demonstrate the benefits of reverting TPS to the legacy way, on my 3 node VSAN cluster with 48GB of RAM, here are the memory stats before and after:

HOST1:f1
f2

HOST2:f3
f4

HOST3:
f5

f6

As you can see from the PSHARE/MB common: saving column, in total across all three hosts, with a little bit of maths, this is a memory reclamation of over 10GB which is just short of a quarter of the RAM available to my entire Cluster.  Why not try it and see what effect it has, at least on your homelab!

Nostalgia for Nostalgia – Prince of Persia OVF still working within vSphere 6

Many years ago, I used to demo the capabilities of VMware by using the freely accessible Nostalgia OVF from the VMware marketplace (I think it was available through vCenter 2.5 at the time). It was such a small and lightweight appliance containing a simple set of well known games that made demonstrating the power of a relatively new production ready technology (it was 2006) all the easier. I remember sitting in various meetings with clients and decision makers talking about and showing vMotion, Fault Tolerance and HA whilst playing Prince of Persia. I also remember using CPU Hog to enforce DRS activity as the icing on the cake to combine vMotion and intelligent resource placement. It was such as a simple but effective way of getting the message across about the capabilities of what could be done and how VMware was to be a game changer in server deployment, cost reduction and resource optimization.

Earlier this week, I had a Nostalgic moment, wondering if I can still do the same thing today that I did all those years ago – re-performing some new tests but leveraging a number of other product features available in the VMware portfolio (SRM, vSAN stretched cluster etc).

I set out to find the Nostalgia OVF but despite a search through the Virtual Appliance Marketplace (via Solution Exchange) I didn’t have any luck .

I then stumbled across an old VMware community post here that sent me in the right direction of the OVF

http://download3.vmware.com/software/appliances/Nostalgia.ovf

After running through the typical OVF deployment process and entering the above URL, the VM appeared within vSphere 6, residing on my vSAN datastore and waiting to be powered on. The results can be seen below:-

Nostalgia3 Nostalgia2 Nostalgia

 

 

 

 

 

 

 

 

Not quite sure when my next post will be, lets see how long it takes me to relive some of my childhood gaming memories ;o)

 

Handling of problematic disks in vSAN 6.1 – HomeLab warning

Just a quick note of caution for any other home lab users who are considering using vSAN 6.1. As part of the prep work for building the environment, it is important that if using consumer grade disks and/or bypassing some of the other HCL requirements, if there are sustained periods of high latency (which can be expected depending on how hard you push your kit), you should disable the device monitoring and unmounting process which could otherwise take your disk group offline. Whilst initially I thought this was the silver bullet to the problems I’ve been experiencing, in my scenario, it’s only been the Consumer grade SSD that disappears, not the entire Disk Group containing both the Samsung (consumer) and Intel (Enterprise) SSD.

I’ve copied the key commands below directly from Cormacs blog but I have applied *BOTH* settings in my environment.

  • Disable VSAN Device Monitoring (and subsequent unmounting of diskgroup):
    # esxcli system settings advanced set -o /LSOM/VSANDeviceMonitoring -i 0    <— default is “1″
  • Disable VSAN Slow Device Unmounting (continues monitoring):
    # esxcli system settings advanced set -o /LSOM/lsomSlowDeviceUnmount -i 0   <— default is “1″

The official VMware article on this can be found here KB2132079

Cormac Hogans blog article can be found here

The homelab rebuild.. vSAN Progress and initial VMs..

Further to my previous post regarding rebuilding my home lab with the Intel SSDs as the caching tier for an all flash vsan, unfortunately within a day, one of the ESX hosts fell over with the usual Permanent Disk Loss error and I had a sad face. I rebooted the host and re-applied the storage policy to bring the hosts back into compliance and thought I’d give everything one last chance before reverting to the magnetic disks. Since then (3 days and counting), the environment has stayed up and online and in fact I have pushed it harder than ever before by running multiple clones (at least 3 at a time) to properly kick the tyres at risk of building lots of VMs only for me to have to svMotion them over to my external array which is time consuming.

On average, a 40GB Windows 2012 Virtual Machine is taking no more than 7 minutes to clone and at the time, as I’ve only got Gb connectivity between hosts as part of the vSan cluster, the network is actually the bottleneck here at 125MB/s (and that would be assuming it was flat out and there was not overhead/transmit issues)

1Gb = 125MB/s
125MB/s x 60 = 7500MBs / Minute
40GB / 7500MB = roughly 5.5 minutes

A quick breakdown of the VM build so far:

2x Win2k12 Domain Controllers, running DNS and acting as a CA
1x VCSA
1x SQL 2014 VM – hosting the ViewComposer DB
2 x View Connection Servers
1 x View Composer for Horizon View
1 x AppVolumes Server

I’ve been particularly light on the customisation side, but have green lights where green lights need to exist on the solutions I’ve built thus far. The most time consuming piece was the Certification piece, involving the replacement of the machine cert on the VCSA alongside working out how to reissue the Certs for the view connection and composer servers after I’d already performed the installs. From experience I’ve always had fun with certificates in Horizon View deployments, but this time round wasn’t as painful as I knew most of the pitfalls and gotchas. For those that administer Horizon View, this is a joy to see post installation:-

 

ViewGreen

 

I used some of the following blogposts/links as reference for redeploying certificates:-

https://blogs.vmware.com/vsphere/2015/06/creating-a-microsoft-certificate-authority-template-for-ssl-certificate-creation-in-vsphere-6-0.html

https://blogs.vmware.com/vsphere/2015/07/custom-certificate-on-the-outside-vmware-ca-vmca-on-the-inside-replacing-vcenter-6-0s-ssl-certificate.html

https://pubs.vmware.com/horizon-62-view/index.jsp#com.vmware.horizon-view.certificates.doc/GUID-DC255880-8AB2-45BF-93D9-14942DBE13AB.html

Unable to connect to the MKS: Console access to the virtual machine cannot be granted since the connection limit has been reached

Today I was in the process of managing my VMs and as I use a Mac with the VMware Remote Console, it can sometimes be a little flakey in terms of stability. This isn’t normally a big deal for me as typically I’ll reopen the MKS session and pick up from where I left off. For some reason, today was a little different and after the usual “crash”, I attempted to connect back across and was presented with “Unable to connect to the MKS: Console access to the virtual machine cannot be granted since the connection limit of 1 has been reached”.

mka

I then tried the integrated console but received a similar message of “You have reached the maximum number of connected consoles: 1. Please contact your administrator.”

mk2

I knew that restarting the VM would clear the issue, or powering it down to increase the number connections permitted (KB2015407) would be a work around but the problem was I didn’t know what state the VM was in as I was mid installation so couldn’t really justify pulling the plug.

At that point I thought I’d try a quick vMotion between the hosts and as if by magic, my subsequent attempt to connect to the console in either way sprung back into life!

VCP 6 Qualified

CXTx1tYWAAARVLW

Today I sat 2V0-621D and upon completion of the test, it advised me I had passed successfully (woohoo). I now plan to continue on my learning path, juggling as do many people a hectic personal life, educational and work balance. I see 2016 as a big year for vSphere 6 with many customers making the leap from traditional Windows vCenter to VCSA. For me, taking the exam was an important milestone to bolster the experience I have with a properly certified status to reaffirm my commitment to the technology as I have done since my first VMware exam back in 2008.

VMUG Advantage

To compliment my “being more involved” hat, I’ve subscribed to the VMUG Advantage program to take “advantage” of the benefits it can bring as well as to get more opportunity to engage with others in the community. Some of the most obvious reasons for me to sign up included:-

  • Discounts on Examinations
  • Discounts on Training
  • $600 worth of service credit for vCloud Air
  • Recognition
  • Access to EvalExperience for access to licensing of the Core VMware stack to tick along in the HomeLab

I’m hoping it will be a worthwhile investment so that 2016 can push my technical capabilities that extra mile.

Unresponsive guest – hung VM

I’ve encountered a few scenarios where virtual machines have just refused to power off and each time I find myself hunting down the best method to kill them indefinitely. Occassionally these “hung” virtual machines are as a result of losing sight of their storage – yet the memory thread still stays resident.

Firstly, it’s best to determine if the VM really is still running:-

vmware-cmd -l
(this lists the Virtual Machines on the host – on and off)

copy the full path to the VM that you wish to query i.e. /vmfs/volumes/4a69985-29b83f0c-5ee5-001b3432f0d0/vm.vmx

and insert it into

vmware-cmd (path) getstate

i.e. vmware-cmd /vmfs/volumes/4a69985-29b83f0c-5ee5-001b3432f0d0/vm.vmx getstate

if the host believes the virtual machine is still on, it will return
getstate() = on

if the machine is in fact off, it will return
getstate() = off

If it is still running and you are unable to shut it down using the vSphere/VI client, here are a couple of ways to kill off any unresponsive virtual machines:-

vmware-cmd (path) stop

validate whether this has been successful with another getstate command

vmware-cmd (path) getstate

if unsuccessful, try a stop hard request

vmware-cmd (path) stop hard

once again, checking to see if this has worked

vmware-cmd (path) getstate

—————————-

Alternatively, you could try:

vm-support -x
(this displays a list of running VMs and their associated World IDs)

vm-support -X <wid>
(this attempts to kill off the process with the World ID specified)

—————————-

Finally, and as a last resort:-

ps -g | grep <VMname>

This will show the following

649451      vmm0:VMname
649453      vmm1:VMname
649640 649448 mks:VMname       649448 649448  /bin/vmx
649641 649448 vcpu-0:VMname    649448 649448  /bin/vmx
649642 649448 vcpu-1:VMname    649448 649448  /bin/vmx

The first column is the World ID (WID), the second column is CID and the fourth column is the Process Group ID (PGID). The PGID is the relevant value required (649448).

kill -9 <PGID>
i.e. kill -9 649448

Using the kill command, the unique processes for this VM should now be terminated. I have found that whilst this works, it does sometimes reset the VM.

VM-flat.vmdk file – check.. but where’s the vmdk file?

So after spending many hours migrating all my VMs back to their uber performing datastores, I went to power on my secondary DC only to find it would not start up.

Something to do with a missing file.

“The system cannot find the file specified.
Cannot open the disk ‘DC02.vmdk’ or one of the snapshot disks it depends on.
VMware ESX cannot find the virtual disk “DC02.vmdk”. Verify the path is valid and try again. ”

I immediately checked the configuration settings in vCenter and all appeared correct. The datastore browser confirmed that it could see the 10GB vmdk file – so what could it be?

I never trust a GUI, so ssh’d over to the TSM and did a quick directory listing to find that whilst the -flat.vmdk file was there, the .vmdk file wasn’t! In the migration back, somehow the VM had lost the file that controls its understanding of disk geometry, controller type, provisioned format (thin/thick).

Knowing I had the -flat file was re-assuring, had the shoe been on the other foot and all I was left with was the vmdk file, I would have been a little lot more concerned.

The first step to resolution was to create a new identically sized virtual disk to the -flat file I had been left with. In turn, this would then create a new VMDK file that I could borrow.

1) Determine the existing -flat.vmdk file size

ls -l *-flat.vmdk
-rw——-    1 root     root         4841537536 Feb 24 23:38 DC02-flat.vmdk

2) Determine the controller type associated with this disk

less *.vmx | grep -i virtualdev

scsi0.virtualDev = “lsilogic”

In this instance the VM used the lsilogic controller

3) Armed with this information, I now create a new vmdk

vmkfstools -c 4841537536 -a lsilogic -d thin temp.vmdk
(the -d thin parameter provisions the disk as thin as we don’t really want the -flat file anyway)

4) The result is a temp-flat.vmdk and a temp.vmdk file.

5) Rename the temp.vmdk file to match the VM name – in this case DC02

mv temp.vmdk DC02.vmdk

6) Edit DC02.vmdk using VI and update the extent description contents to match the server name

# Extent description
RW 20971520 VMFS “tempDC02-flat.vmdk”

7) If the original -flat.vmdk was thinly provisioned you do not need to modify any additional parameters in the file, however if it was thick, you must remove the following line:-

ddb.thinProvisioned = “1”

8) Delete the temp-flat.vmdk created in step 3 and you should be good to go!

New iSCSI server – but where’s my old VMFS volume – its Missing!

My existing iSCSI setup wasn’t delivering the I/O I expected so I went about upgrading both my eSATA array controller so I could RAID 10 across the 8 drives I have in my external drive enclosure (rather than the 4 the previous controller would allow) and in addition to that built a new Windows 2008 physical server to drive the I/O (rather than running it off my old Windows XP instance). To do this meant I had to offload all my existing VMFS data to another location temporarily to allow me to recreate the RAID. This was done using a number of external USB HDDs attached to the old iSCSI target server and passing them through as iSCSI targets to VMware. The VMs were then sVMotioned between iSCSI datastores until the external enclosure was free!

I installed Starwind (my iSCSI target software of choice) on the new server and hooked up the USB HDDs. I then proceeded to reconfigure it to represent these iSCSI HDDs to VMware.

I rescanned the iSCSI adapter but to my surprise couldn’t see the VMFS volume – only the LUN itself. Having worked with resignaturing in the past, I realised that the volume must still be there lurking in the background, it was merely being masked by VMware because it believed it was a snapshot because it was previously presented to the host under a different iSCSI IQN.

So without further ado, I ssh’d over to the TSM and ran the following command to confirm my thoughts:-

esxcfg-volume -l

The output produced the following:

VMFS3 UUID/label: 49d22e2e-996a0dea-b555-001f2960aed8/USB_VMFS_01
Can mount: Yes
Can resignature: Yes
Extent name: naa.60a98000503349394f3450667a744245:1 range: 0 – 397023 (MB)

Good news for me – the old named VMFS volume was still visible.  

So, to re-add this back into the Storage view so that I could Storage vMotion the VMs back to my new 8 disk RAID setup, I ran the following command

esxcfg-volume -M USB_VMFS_01

(you can specify -m if you only wish to mount it once. -M mounts the drive persistently).

Tada! VMFS volumes all present and correct.

I’m now seeing a HUGE performance gain from using the 8 disks and I’m going to try my hardest to push the limits of the 1GB iSCSI connection before I consider adding a second NIC for Round Robin on both the VMware hosts and iSCSI target server.