Midterm Exam
Data Science ~ ITSB ~ Class A
1 Instructions
You are provided with a dataset containing numerical and categorical variables. Using this dataset, complete the following tasks to demonstrate your understanding of data visualization, central tendency analysis, and measures of dispersion.
1.1 Data Visualization
- Create at least five types of visualizations (e.g.,
bar chart, histogram, and line-chart) to explore the data.
- Label each graph clearly with appropriate titles, axis
labels, and legends.
- Briefly interpret each visualization (e.g., describe trends, patterns, or potential outliers).
1.2 Central Tendency Analysis
- Calculate the mean, median, and mode for at least
two numerical variables.
- Compare and interpret the results — discuss what each measure
indicates about the data distribution.
- Identify any skewness in the data based on the relationship between the mean, median, and mode (use Histogram).
1.3 Measures of Dispersion
- Compute the range, variance, standard deviation, and
interquartile range (IQR) for the same numerical variables used
above.
- Interpret these values — explain what they reveal about the
spread or variability in the dataset (use Box-plot,
Histogram, and Scatter-plot).
- Identify which variable shows greater variability and discuss possible reasons.
1.4 Summary and Interpretation
Write a short summary (150–200 words) explaining the overall findings
from your analysis.
Your summary should address: - Which variables are most consistent (low
dispersion)
- Which variables show the greatest variation
- What patterns or insights you discovered through visualization
2 Group Assignments
To enhance collaborative learning and the application of statistical concepts, students are divided into amount of groups. Each group will work on assigned case studies and publish their analysis using R Markdown and RPubs. The project aims to develop analytical, technical, and communication skills through both written and video presentations.
- Work in Groups
Each group listed below must collaborate to complete a data analysis project based on materials from Chapters 1–5 of the course book.
Develop an R Markdown Report:
- Use RStudio to create a report in .Rmd format.
- Include data import, visualization, and interpretation.
- Publish the final report on RPubs.
Add a YouTube Explanation Video:
- Each group must actively stand up to create a 5–10 minute video presentation explaining the content of their RPubs report.
- Upload the video to YouTube.
- Include the YouTube video link at the end of your RPubs page under the section “Video Explanation”.
| Group | Student ID | Name | Dataset |
|---|---|---|---|
| Group 1 | 52250001 | ANGELIQUE KIYOSHI LAKEISHA BAHRUL ULUM | Employee Retention Dataset |
| 52250002 | PUTRI ADRIA GARINI | ||
| 52250003 | NAILATUL WAFIROH | ||
| 52250004 | NAYCHILA ADELIA ZAHRAH | ||
| 52250005 | FRENKHY TONGA RETANG | ||
| Group 2 | 52250006 | NADIA APRIANI | Sales Dataset |
| 52250007 | YOSEF TEOFANI TAMBA | ||
| 52250008 | ARYA FHAREZI | ||
| 52250009 | DHEA PUTRI KHASANAH | ||
| 52250010 | WULAN GUSTIKA ANTASYA TUMANGGOR | ||
| Group 3 | 52250011 | CHRISTIAN MICHAEL JULIANO | Urban Business Dataset |
| 52250012 | HIROSE KAWARIN SIRAIT | ||
| 52250013 | CECILIA MUTIARA HANDAYANI | ||
| 52250014 | DHEFIO ALIM MUZAKKI | ||
| 52250015 | M. YUSTIAN PUTRA MUHADI | ||
| Group 4 | 52250018 | KHAFIZATUN NISA | Insurance Dataset |
| 52250019 | RAFAEL YOGI SEPTIADI PUTRA | ||
| 52250020 | RONI KURNIAWAN | ||
| 52250021 | VERÓNICA MARIA LUCIA FERREIRA XAVIER | ||
| 52250023 | NAKEISHA AULIA ZAHRA | ||
| Group 5 | 52250024 | JIHAN RAMDHANI DEANDRI | Investment Dataset |
| 52250025 | ANINDYA KRISTIANINGPUTRI | ||
| 52250027 | HAURA AZIZAH ACHMAD | ||
| 52250030 | RISKY NURHIDAYAH | ||
| 52250031 | M. FITRAH AIDIL HARAHAP | ||
| Group 6 | 52250032 | HANAFI MALIK RIFA’I | Agriculture Dataset |
| 52250033 | SAFINA ZAHRA | ||
| 52250036 | AHMAD RIZKI MUBARAK | ||
| 52250037 | NURUL IFFAH | ||
| 52250038 | FIFI MUTHIA PITALOKA | ||
| Group 7 | 52250039 | CLARA MAISIE WANGHILI | Education Dataset |
| 52250040 | NAISYA HAFIZH MUFIDAH | ||
| 52250041 | CHELSEA TESALONIKA PATRICIA HUTAJULU | ||
| 52250042 | ULIN NIKMAH | ||
| 52250043 | PASKALIS FARELNATA ZAMASI | ||
| Group 8 | 52250045 | NAZWA NUR RAMADHANI | Hospital Dataset |
| 52250048 | VANESSA ZIBA ARDELIA | ||
| 52250049 | ZIDHAN ALFAREZI AFDI | ||
| 52250050 | DEN YUAN FRASSEKA |
3 Dataset
3.1 Sales Dataset
In the modern business era, strategic decisions are no longer made based solely on intuition but must be supported by comprehensive data analysis. Sales data, marketing expenditures, customer satisfaction levels, and organizational characteristics such as store size and managerial experience can all provide valuable insights into business performance.
3.2 Urban Business Dataset
In rapidly growing urban economies, businesses operate within highly dynamic environments influenced by population density, consumer preferences, technological adaptation, and competitive markets. Understanding the factors that drive monthly revenue — such as marketing expenditure, product pricing, workforce size, managerial experience, and customer satisfaction — has become increasingly vital for strategic decision-making.
Urban business performance also varies by city and industry sector, with unique patterns emerging between retail, technology, manufacturing, and food & beverage sectors. To uncover these patterns, it is necessary to apply data visualization, measures of central tendency, and measures of dispersion to explore how revenue fluctuates across business types, cities, and sales channels.
By conducting this descriptive and visual analysis, organizations can identify not only performance gaps but also opportunities for optimization in marketing strategy, pricing, and human resource management.
3.3 Hospital Dataset
In the modern healthcare system, hospitals generate massive amounts of data every day—from patient admissions, treatment records, and medication usage to doctor performance and cost management. These data hold valuable insights that can help improve patient outcomes, optimize operational efficiency, and support evidence-based decision-making.
However, healthcare data are often complex, involving both categorical variables (such as department, patient type, or region) and numerical variables (such as patient age, treatment cost, and recovery time). To make sense of this complexity, it is crucial to apply descriptive statistical analysis and data visualization methods.
Through measures such as central tendency (mean, median, mode) and dispersion (range, variance, standard deviation), analysts can understand variations in patient outcomes, resource allocation, and treatment performance.
3.4 Insurance Dataset
In the insurance industry, data-driven decision-making plays a vital role in managing risk, pricing policies, and predicting customer claims. Each insured individual represents a unique combination of demographic, behavioral, and health-related characteristics that collectively determine their risk profile and potential claim cost.
However, analyzing such data is challenging due to the interaction between categorical variables (e.g., region, insurance plan, smoking status, employment type) and numerical variables (e.g., age, BMI, income, and health score). Understanding these relationships requires the use of descriptive statistics, data visualization, and statistical modeling techniques. By applying measures of central tendency and measures of dispersion, analysts can explore the variability of claims across different customer segments.
3.5 Investment Dataset
Investment decisions in the modern financial landscape are increasingly complex and influenced by multiple economic and personal factors. Investors differ not only in terms of age, income level, and financial goals, but also in their risk tolerance and investment strategies. These variations contribute to diverse investment outcomes in terms of return, volatility, and asset growth.
To better understand investor behavior and performance, a comprehensive analytical approach is required. The dataset includes categorical variables such as investor segment, region, and investment type, along with numerical variables including investment amount, risk score, portfolio diversification, and annual return percentage.
Through the use of descriptive statistics, such as measures of central tendency (mean, median) and measures of dispersion (standard deviation, variance), analysts can quantify and visualize variations in investor performance. Meanwhile, data visualization techniques provide clear insights into patterns and distributions within the dataset.
3.6 Employee Retention Dataset
Employee retention is a key concern for organizations seeking to maintain a skilled and motivated workforce. Various factors, such as job satisfaction, workload, managerial experience, training opportunities, and compensation, influence how long employees stay with a company. Understanding these patterns helps organizations design better policies to improve retention and reduce turnover.
This dataset captures categorical variables like department, employment type, and work location, as well as numerical variables including monthly salary, job satisfaction score, work hours, training hours, performance score, and retention period. By applying descriptive statistics, analysts can summarize the data using measures of central tendency (mean, median, mode) to understand typical employee characteristics, and measures of dispersion (range, variance, standard deviation, interquartile range) to evaluate variability among employees. Data visualization techniques such as histograms, boxplots, and bar charts provide intuitive insights into distributions, patterns, and outliers, enabling organizations to make informed decisions to improve employee engagement and retention strategies.
3.7 Agriculture Dataset
Agricultural productivity is influenced by a variety of environmental, managerial, and resource-related factors. Farmers’ decisions regarding crop type, irrigation method, fertilizer usage, and seed quantity, combined with regional characteristics such as climate and soil conditions, determine crop yield outcomes. Understanding these factors helps agricultural planners, agronomists, and policymakers improve farm efficiency and resource allocation. This dataset includes categorical variables such as region, crop type, irrigation method, and fertilizer type, along with numerical variables including farm size, seed amount, fertilizer amount, rainfall, temperature, and farmer experience. The target variable is crop yield, measured in tons per hectare.
By applying descriptive statistics, analysts can summarize typical farm characteristics using measures of central tendency (mean, median, mode) and assess variability using measures of dispersion (range, variance, standard deviation, interquartile range). Data visualization techniques—such as histograms, boxplots, and bar charts—enable clear exploration of patterns, distributions, and outliers, providing actionable insights for improving agricultural practices and maximizing crop yield.
3.8 Educations Dataset
Student academic performance is influenced by a wide range of factors that encompass personal, familial, and institutional characteristics. Variables such as study hours, teacher experience, class size, parental support, and student motivation play crucial roles in shaping learning outcomes. Additionally, factors like attendance rate, access to learning resources, participation in extracurricular activities, and prior academic achievements further contribute to variations in performance across students.
This dataset includes both categorical variables, such as SchoolType (Public, Private, Charter), GradeLevel (Elementary, Middle, High, University), Region (North, South, East, West), and TeachingMethod (Traditional, Blended, Online), as well as numerical variables like:
- StudyHours (weekly study hours)
- TeacherExperience (years of experience)
- ClassSize (number of students per class)
- ParentalSupport (scale 1–10)
- StudentMotivation (score 0–100)
- AttendanceRate (%)
- ResourceAccessScore (scale 0–100, availability of learning resources)
- ExtracurricularHours (hours/week spent in activities)
- PriorGPA (previous grade point average)
The target variable, AcademicPerformance, reflects the overall student achievement on a 0–100 scale.
By applying descriptive statistics, analysts can summarize the dataset using measures of central tendency (mean, median, mode) to understand typical student characteristics and measures of dispersion (range, variance, standard deviation, interquartile range) to evaluate variability among students.
Data visualization techniques such as histograms, boxplots, scatterplots, and bar charts provide clear insights into distributions, patterns, correlations, and outliers. These visualizations help educators and policymakers identify trends, assess the impact of various factors, and make informed decisions to improve teaching strategies, resource allocation, and student support programs.