Yousef Hosny Elsayed (S2141806) - Group Leader
Hanani Nurshafira Binti Hamdan (S2150141)
Jasmeen Bong Kah Ying (S2142739)
Loh Cin Ceat (S2141070)2022-06-19
Initial Question
What are the characteristics of students who achieve high/low grades?
Given a new student, how can we predict the mark of the final grade based on the specific characteristics?
Our product is a shiny app that is powered by regression model and capable for predicting student’s performance in secondary school. The regression model is trained on 70% of the dataset listed below and tested on 30%.
Data source
https://archive.ics.uci.edu/ml/datasets/Student+Performance
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.
Feature Selection according to heatmap
Performance of Training Set
MSE: 2.505773 MAE: 0.9221254 RMSE: 1.582963 R2 0.8320791
Performance of Testing Set
MSE: 2.211841 MAE: 0.9043607 RMSE: 1.487226 R2 0.8320791
Scatter Plot of Testing Set and Feature Importance Plot
Link to Shiny App https://jasmeen.shinyapps.io/StudentPerformancePrediction/
Link to Github https://github.com/yousefhosny1/student-performance-shiny-app
Our Experience The experience was amazing, as university students, the project was relevant to us to understand the factors that impact our grades in this and the upcoming semester. So considering the feature importance outputted by our model, the most contributing factors to our grades is our previous grades, previous failures and travel time, considering that my group members had good grades prev semester, no failures and the semester is taught online so travel time is minimal we expect to score well this semester. This project also enhanced our R skills, also our team working and communication skills improved significantly as we had to stay composed when dealing with bugs and errors and had to teamwork and communicate efficiently to resolve them.
Data security is one of the most pressing issues in data science, affecting businesses all over the world. A few examples of data security breaches include data system attacks, ransomware, and theft. Theft of information is the most common data security concern, particularly for organizations that have access to sensitive data such as financial or customer personal information. The threat to data travelling over the network has increased exponentially as the amount of information exchanged over the Internet has increased.
Another issue faced by Data Science today is multiple data sources. To gather and manage information about their clients, sales, or workers, businesses have begun utilizing a variety of software and mobile applications, such as CRMs and ERPs. It might be difficult to consolidate data from various sources of unstructured or semi-structured information. Because each tool collects information in a unique way, the formats that result are not uniform. Additionally, this implies that there are numerous sources from which to process and retrieve data. Data scientists frequently struggle to comprehend and derive useful insights from heterogeneous sources. As a result, people wind up spending more time filtering it, which results in mistakes and inaccurate judgement.
As more organizations become more dependent on data science, the demand for skilled data professionals is increasing. Many employees have been unable to keep up with development because the traditional way of working with data has changed over the years. Not only in data science, but also in the general technology sector, there are major skill gaps and talent shortages.Organizations often struggle to find the right people with the right level of knowledge and expertise to build a machine learning team. In addition to finding the right people with the right expertise, companies also struggle to find employees with the right business perspective on data science. This is just as important as knowledge of the subject, as machine learning projects are only successful if the machine learning team can solve key business problems and use the data to tell the right story.
One of the foremost pressing challenges of massive Data is storing these huge sets of knowledge properly. The quantity of knowledge being stored in data centers and databases of companies is increasing rapidly. As these data sets grow exponentially with time, it gets challenging to handle. Most of the info is unstructured and comes from documents, videos, audio, text files, and other sources. This suggests that you cannot find them in the database. Companies often get confused while selecting the simplest tool for giant Data analysis and storage. Is HBase or Cassandra the simplest technology for data storage? Is Hadoop MapReduce ok, or will Spark be a far better option for data analytics and storage? These questions bother companies, and sometimes they’re unable to seek out the answers. They find themselves making poor decisions and selecting inappropriate technology. As a result, money, time, effort, and work hours are wasted.