
Snr Software Engr, Infra Arch & Svc Mgt
- Singapore
- Permanent
- Full-time
- Implement and enable applications for Splunk APM (Application Performance Monitoring), RUM (Real User Monitoring), and Synthetic monitoring.
- Onboard new applications and services to the Splunk monitoring platforms, APM, RUM, and Synthetic.
- Maintain Splunk Observability tools including Infrastructure Monitoring
- Continuously tune and optimize alert rules to reduce false positives/negatives and improve incident detection.
- Coordinate with the Operations team for patching and upgrading of Splunk agents across environments.
- Proactively monitor hosts, applications, and infrastructure components to detect and resolve issues early.
- Create and fine-tune dashboards to provide actionable insights and visibility into system and application health.
- Configure and manage paging and alerting duties to ensure timely responses to incidents.
- Collaborate with cross-functional teams to improve monitoring coverage and effectiveness.
- Organize and lead bi-monthly operational meetings related to Splunk and monitoring activities.
- Manage Business As Usual (BAU) tasks along with continuous enhancements to the monitoring solutions.
- Develop, maintain, and update documentation, including Standard Operating Procedures (SOPs) and knowledge base articles.
- 2+ years of hands-on experience with Splunk Observability Suite, including APM, RUM, Synthetic Monitoring, and Infrastructure Monitoring.
- Proven experience implementing and operationalizing AIOps platforms and developing monitoring-related use cases.
- Strong experience managing and supporting Splunk Enterprise, ITSI, or ELK Stack in production across medium to large-scale environments.
- Expertise in setting up and integrating Splunk or ELK Stack (Elasticsearch, Logstash, Kibana) to aggregate and process data from diverse sources.
- Proficiency in data visualization techniques and tools such as Kibana and Splunk Dashboards to generate actionable insights.
- Experience optimizing and tuning alert rules, thresholds, and detectors to reduce false positives/negatives in observability tools.
- Experience in configuring incident response mechanisms, including alert routing, paging tools, and on-call setup.
- Solid understanding of integration with ITSM platforms (e.g., ServiceNow) and automated remediation workflows.
- Familiarity with integrating observability tools with automation/configuration management tools such as Ansible, Terraform, or Chef.
- Working knowledge of SQL with the ability to optimize queries for reporting and analytics.
- Understanding of Security and Access Management in observability and monitoring platforms.
- Comfortable working across Linux/Unix/Windows environments and managing agents and services on these platforms.
- Experience collaborating with cross-functional teams and driving monitoring improvements and operational efficiency.
- Excellent documentation skills with experience creating and maintaining Standard Operating Procedures (SOPs) and knowledge base articles.
- Strong troubleshooting and analytical skills for identifying performance bottlenecks in infrastructure and applications.