New CE cluster stuck on creating cluster

Hi Nutanix-Community,

trying to create a Nutanix CE cluster with 4 nodes out of our old hardware. This are the specs of the nodes:

2 x Node:
- CPU: Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
- Memory: 1.0Ti
- Storage: 6x 3.5 TiB
- Network: 2x Mellanox Technologies MT27710 Family 1ConnectX-4 Lx]

2x Node:
- CPU: Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz
- Memory: 1.0Ti
- Storage: 6x 3.5 TiB
- Network: 2x Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28

I used following ISO image to install the nodes:

- ISO used for install: phoenix.x86_64-fnd_5.6.1_patch-aos_6.8.1_ga.iso

After I installed the nodes I run following command on one of the CVM to create a new cluster:
- cluster -s 172.19.1.121,172.19.1.122,172.19.1.123,172.19.1.124 create

But short after the cluster creation process stuck and shows the following message in a refreshing state: (Waiting about 12h but nothing has changed)

Waiting on 172.19.1.121 (Up) to start:
DynamicRingChanger
Pithos
InsightsDB
Athena
Mercury
Mantle
Stargate
InsightsDataTransfer
GoErgon
Cerebro
Chronos
Curator
Prism
Hera
AlertManager
Arithmos
Catalog
Acropolis
Castor
Uhura
NutanixGuestTools
MinervaCVM
ClusterConfig
APLOSEngine
APLOS
PlacementSolver
Lazan
Polaris
Delphi
Security
Flow
Anduril
Narsil
XTrim
ClusterHealth

Waiting on 172.19.1.122 (Up, ZeusLeader) to start:
DynamicRingChanger
Pithos
InsightsDB
Athena
Mercury
Mantle
Stargate
InsightsDataTransfer
GoErgon
Cerebro
Chronos
Curator
Prism
Hera
AlertManager
Arithmos
Catalog
Acropolis
Castor
Uhura
NutanixGuestTools
MinervaCVM
ClusterConfig
APLOSEngine
APLOS
PlacementSolver
Lazan
Polaris
Delphi
Security
Flow
Anduril
Narsil
XTrim
ClusterHealth

Waiting on 172.19.1.123 (Up) to start:
DynamicRingChanger
Pithos
InsightsDB
Athena
Mercury
Mantle
Stargate
InsightsDataTransfer
GoErgon
Cerebro
Chronos
Curator
Prism
Hera
AlertManager
Arithmos
Catalog
Acropolis
Castor
Uhura
NutanixGuestTools
MinervaCVM
ClusterConfig
APLOSEngine
APLOS
PlacementSolver
Lazan
Polaris
Delphi
Security
Flow
Anduril
Narsil
XTrim
ClusterHealth

Waiting on 172.19.1.124 (Up) to start:
DynamicRingChanger
Pithos
InsightsDB
Athena
Mercury
Mantle
Stargate
InsightsDataTransfer
GoErgon
Cerebro
Chronos
Curator
Prism
Hera
AlertManager
Arithmos
Catalog
Acropolis
Castor
Uhura
NutanixGuestTools
MinervaCVM
ClusterConfig
APLOSEngine
APLOS
PlacementSolver
Lazan
Polaris
Delphi
Security
Flow
Anduril
Narsil
XTrim
ClusterHealth

Do you have any hints where I can take a look into to find out why the services does not start as expected during the cluster creation process?

Thank you very much in advance!

Page 1 / 1

Hey,

That’s great lab hardware, very jealous ;)

What do you see in the genesis.out on the nodes during this period?

I suspect something is stuck trying to start and everything else is hanging off the back of it, does it work if you ctrl + c and run it again?

Can all CVMs see each other happily etc?

Are you using multiple NIC’s plugged in or just one at the moment?

Thanks,

Kim

Update: My first reply still needs to be checked by a moderator my seconds reply this one seems to be directly shown thats why I am adding the missing infos from the first reply as well because they will not be shown so far.

Thank you for you quick reply Kcmount!

I already tried to stop the process with CTRL+C and re run the cluster creation process. Unfortunately the process says that the nodes can not be added because there are already part of a cluster. So I stopped the cluster and destroyed it to be able to recreate the cluster with the command.

Right now all nodes have two NICs and on interface of both are connected to switch. Also all ports are in the same VLAN and the nodes are able to see each other.

I only checked genesis.log not genesis.out log. Checked it on all 4 nodes and on node 3-4 there seems to be issues with Cassandra on node 1-2 it looks OK from my site.

Node 3:

2025-05-22 09:17:59,212Z INFO 19420864 service_mgmt_utils.py:146 current_disabled_services {'SnmpService'}
2025-05-22 09:17:59,212Z INFO 19420864 service_mgmt_utils.py:147 current_enabled_services set()
2025-05-22 09:17:59,813Z INFO 19420864 zookeeper_service.py:598 Zookeeper is running as follower
2025-05-22 09:17:59,813Z INFO 19420864 genesis_utils.py:8761 Executing cmd: /usr/local/nutanix/secure/bin/ntnx_privileged_cmd genesis get_pids --all
2025-05-22 09:17:59,905Z WARNING 27989056 cassandra_service.py:854 Unable to get the Cassandra token for 172.19.1.123, ret 3, stdout , stderr Error: "Failed to retrieve RMIServer stub: javax.naming.ServiceUnavailableException eRoot exception is java.rmi.ConnectException: Connection refused to host: localhost; nested exception is:
        java.net.ConnectException: Connection refused (Connection refused)]" connecting to JMX agent, host: localhost port: 8080


2025-05-22 09:18:00,674Z INFO 19420864 zookeeper_service.py:598 Zookeeper is running as follower
2025-05-22 09:18:00,675Z INFO 19420864 epsilon_service.py:395 Checking epsilon status
2025-05-22 09:18:00,708Z INFO 19420864 service_mgmt_utils.py:146 current_disabled_services {'SnmpService'}
2025-05-22 09:18:00,709Z INFO 19420864 service_mgmt_utils.py:147 current_enabled_services set()
2025-05-22 09:18:01,106Z WARNING 27989056 cassandra_service.py:854 Unable to get the Cassandra token for 172.19.1.123, ret 3, stdout , stderr Error: "Failed to retrieve RMIServer stub: javax.naming.ServiceUnavailableException eRoot exception is java.rmi.ConnectException: Connection refused to host: localhost; nested exception is:
        java.net.ConnectException: Connection refused (Connection refused)]" connecting to JMX agent, host: localhost port: 8080

Node 4:

2025-05-22 09:18:00,710Z INFO 29804576 service_mgmt_utils.py:146 current_disabled_services {'SnmpService'}
2025-05-22 09:18:00,710Z INFO 29804576 service_mgmt_utils.py:147 current_enabled_services set()
2025-05-22 09:18:00,714Z INFO 29804576 genesis_utils.py:8761 Executing cmd: /usr/local/nutanix/secure/bin/ntnx_privileged_cmd genesis get_pids --all
2025-05-22 09:18:00,845Z WARNING 00528512 cassandra_service.py:854 Unable to get the Cassandra token for 172.19.1.124, ret 3, stdout , stderr Error: "Failed to retrieve RMIServer stub: javax.naming.ServiceUnavailableException eRoot exception is java.rmi.ConnectException: Connection refused to host: localhost; nested exception is:
        java.net.ConnectException: Connection refused (Connection refused)]" connecting to JMX agent, host: localhost port: 8080

2025-05-22 09:18:00,960Z INFO 29804576 epsilon_service.py:395 Checking epsilon status
2025-05-22 09:18:02,006Z WARNING 00528512 cassandra_service.py:854 Unable to get the Cassandra token for 172.19.1.124, ret 3, stdout , stderr Error: "Failed to retrieve RMIServer stub: javax.naming.ServiceUnavailableException eRoot exception is java.rmi.ConnectException: Connection refused to host: localhost; nested exception is:
        java.net.ConnectException: Connection refused (Connection refused)]" connecting to JMX agent, host: localhost port: 8080

Can you tell a bit more about the disks in the nodes? (nvme, ssd, hdd etc?) And are you using RAID?

And also, remove 2 disks per node as 4 disks per node is the limit for CE. (not counting the ahv boot disk)

Hi JeroenTielen,

the hardware comes from an old Nutanix cluster which was take out of service.

I have configured the disk like shown in the installer setup. I was not fully sure if I mark multiple disk as CVM disk if I still can use the disk as data. But this cluster should be a testing cluster to get some experience with Nutanix CE before we create a real Nutanix CE cluster.

So if I did some mistakes in the setup please let me know :)

Node 1:
Disk 1 - SSD 223.6 GiB - Hypervisor
Disk 2 - SSD 223.6 GiB - CVM
Disk 3 - SSD 3.5 TiB - Data
Disk 4 - SSD 3.5 TiB - Data
Disk 5 - SSD 3.5 TiB - Data
Disk 6 - SSD 3.5 TiB - Data
Disk 7 - SSD 3.5 TiB - Data
Disk 8 - SSD 3.5 TiB - Data

Node 2:
Disk 1 - SSD 223.6 GiB - Hypervisor
Disk 2 - SSD 223.6 GiB - CVM
Disk 3 - SSD 3.5 TiB - Data
Disk 4 - SSD 3.5 TiB - Data
Disk 5 - SSD 3.5 TiB - Data
Disk 6 - SSD 3.5 TiB - Data
Disk 7 - SSD 3.5 TiB - Data
Disk 8 - SSD 3.5 TiB - Data

Node 3:
Disk 1 - Marvell_NVMe_Controller__1 476.9 GiB - Hypervisor
Disk 2 - SSD 3.5 TiB - CVM
Disk 3 - SSD 3.5 TiB - CVM
Disk 4 - SSD 3.5 TiB - Data
Disk 5 - SSD 3.5 TiB - Data
Disk 6 - SSD 3.5 TiB - Data
Disk 7 - SSD 3.5 TiB - Data

Node 4 :
Disk 1 - Marvell_NVMe_Controller__1 476.9 GiB - Hypervisor
Disk 2 - SSD 3.5 TiB - CVM
Disk 3 - SSD 3.5 TiB - CVM
Disk 4 - SSD 3.5 TiB - Data
Disk 5 - SSD 3.5 TiB - Data
Disk 6 - SSD 3.5 TiB - Data
Disk 7 - SSD 3.5 TiB - Data

Update: Attached strack_trace_file

I see a stack trace in a loop in cassandra.out log. Unfortunately all posts with the raw log has been marked for checking via moderator team. So I try to paste the stack trace via image to quick this up and hope that I can post it without reviewing.

I understand that the nodes are orignal nutanix node but the limit for ce is 4 disks per node.

i would really suggest to remove all disks and only leave 4x 3,5ssd’s in it for cvm/data. And then select 2x for cvm and 2x for data.

ok, I will try it and reinstall the nodes with 1 hyperbvisor disk and 4 CVM/Data disk (2CVM & 2 Data) and will check if this will solve the issue which I have with node 3&4 .

However, I am surprised that in the installation dialog of Nutanix itself more than 4 disks are preselected for CVM/Data when Nutanix itself specifies a maximum number of 4 CVM/Data disks. Also that you are able to select more than 4 disks. I there any reason for that?

That is only for CE. Normal production/commercial version you can use more. But yes, the installer will show more, and probably it will work as well for you, but better getting it to work withing the requirements and later you can try to add the disks.

One more thing, the first two nodes are those older G6 nodes? If so, enable the built-in RAID controller (in the bios) and create a RAID 1 from the two (2) 240GB disks. And use that one for AHV.

Yes I will create a RAID 1 out of the two 240GB disks and will setup everything as suggested.

Thank you for your help!

I enabled the “Intel VMD Technology” (screenshot 1) and set the “Configure sSATA as” from “AHCI” to “RAID” (screenshot 2). I than created a RAID 1 volume out of the the two 240GB disks with the “Intel Rapid Storage Technology” during startup. But unfortunately after changes. I was not able to select the raid volume. The installer only show still the 240GB disks in the setup dialog and furthermore the data/cvm disks disappeared. (screenshot 3)

I was only able to get the disks back in the install dialog when I changed the “Configure sSATA as” back from “RAID” to “AHCI”. But than I am not able to create a RAID via “Intel Rapid Storage Technology” during startup.

Did I miss anything or doesn't support Nutanix CE software raid volumes?

Good news, the cluster creation command ran without problems this time and the cluster was created successfully.

Maybe someone also know or have a hint why Nutanix didn’t show the RAID volume inside of the setup dialog, or why the other disks will not be shown if “AHCI” has changed to “RAID”.

Thank you very much so far for your help and have a nice weekend!

Reply

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded