Predictive Data

Already complete Code

As we move on from week to week and task to task, the code that you have already completed, will stay on the template but will not run, this is possible by adding eval=FALSE to the corresponding code chunk. Note that the libraries need to be linked to this program as well.

# Install and load necessary libraries
#install.packages("ggplot2") # Install ggplot2 for plotting, if you have already installed the packages, comment this out by enterring a # in front of this command
#install.packages("scales")  # Install scales for formatting
#install.packages("moments") # Install moments for skewness and kurtosis
library(ggplot2)            # Load ggplot2 library
library(scales)             # Load scales library
library(moments)

Setting up your directory in your computer.

This needs to be addressed here.

# Check the current working directory
getwd()

## [1] "C:/Users/abdou/OneDrive - Prod Student NU/Documents/DDS-8500-Principle of Data Science/Module 4"

# in the next line, change the directory to the place where you saved the
# data file, if you prefer you can save your data.csv file in the directory
# that command 7 indicated.
# for example your next line should like something similar to this: setwd("C:/Users/tsapara/Documents")

# Set the working directory to where the data file is located
# This ensures the program can access the file correctly

# setwd("C:/Users/ITsapara/Downloads")

### Choose an already existing directory in your computer.

Setting up your personalized data

# Read the CSV file
# The header parameter ensures column names are correctly read
# sep defines the delimiter (comma in this case)
# stringsAsFactors prevents automatic conversion of strings to factors
df <- read.csv("data.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)

##########################################################
# Define variables A and B based on your student ID
# A represents the first 3 digits, B represents the last 3 digits
A <- 34
B <- 99
Randomizer <- A + B # Randomizer ensures a consistent seed value for reproducibility


# Generate a random sample of 500 rows from the dataset
set.seed(Randomizer) # Set the seed for reproducibility
sample_size <- 500
df <- df[sample(nrow(df), sample_size, replace = TRUE), ] # Sample the dataset

write.csv(df, file = "my_data.csv", row.names = FALSE) # this command may take some time to run once it is done, it will create the desired data file locally in your directory

Knit your file

As practice, you may want now to knit your file in an html. To do this, you should click on the knit button on the top panel, and wait for the rendering file. The HTML will open once it is done for you to review.

It is recommended to practice with RMD and download and review the following cheatsheets: https://rmarkdown.rstudio.com/lesson-15.HTML

In addition, you may want to alter some of the editor components and re-knit your file to gain some knowledge and understanding of RMD. For a complete tutorial, visit: https://rmarkdown.rstudio.com/lesson-2.html

df <- read.csv("my_data.csv", header = TRUE, sep = ",", stringsAsFactors = TRUE)

Cleaning up your data

Step 0. Now that you read the file, you want to learn few information about your data

The following commands will not be explained here; do your research, review your csv file, and answer the questions related to this part of your code.

# Basic exploratory commands
nrow(df)       # Number of rows in the dataset

## [1] 500

length(df)     # Number of columns (or variables) in the dataset

## [1] 16

str(df)        # Structure of the dataset (data types and a preview)

## 'data.frame':    500 obs. of  16 variables:
##  $ ID           : int  57 1056 1086 971 1063 17 1059 1072 988 951 ...
##  $ Gender       : Factor w/ 3 levels "Female","Male",..: 1 1 1 1 2 1 1 3 1 1 ...
##  $ Age          : int  27 27 27 27 30 27 26 28 29 NA ...
##  $ Height       : int  158 168 168 168 182 158 160 185 155 160 ...
##  $ Weight       : int  NA 65 65 65 80 NA 55 85 54 55 ...
##  $ Education    : Factor w/ 5 levels "","Bachelor's",..: 2 2 2 2 4 2 2 3 4 3 ...
##  $ Income       : int  42000 45000 45000 45000 65000 42000 48000 NA 55000 39000 ...
##  $ MaritalStatus: Factor w/ 3 levels "","Married","Single": 3 3 3 3 2 3 3 3 2 2 ...
##  $ Employment   : Factor w/ 3 levels "","Employed",..: 3 2 2 2 2 3 2 3 2 2 ...
##  $ Score        : num  NA 6.2 6.2 6.2 NA NA 6.1 5.7 7.3 6.1 ...
##  $ Rating       : Factor w/ 4 levels "","A","B","C": 2 3 3 3 4 2 3 2 3 2 ...
##  $ Category     : Factor w/ 3 levels "Art","Music",..: 1 1 1 1 2 1 1 3 2 1 ...
##  $ Color        : Factor w/ 4 levels "","Blue","Green",..: 3 2 2 2 4 3 2 3 2 2 ...
##  $ Hobby        : Factor w/ 5 levels "","Photography",..: 4 3 3 3 5 4 4 2 4 4 ...
##  $ Happiness    : num  6.5 7 7 7 8.5 6.5 7 6 8.2 7 ...
##  $ Location     : Factor w/ 4 levels "","City","Rural",..: 2 2 2 2 2 2 3 3 2 2 ...

summary(df)    # Summary statistics for each column

##        ID            Gender         Age            Height          Weight     
##  Min.   :   1.0   Female:206   Min.   :25.00   Min.   :155.0   Min.   :54.00  
##  1st Qu.:  50.0   Male  :185   1st Qu.:27.00   1st Qu.:165.0   1st Qu.:62.00  
##  Median : 991.0   Other :109   Median :28.00   Median :175.0   Median :70.00  
##  Mean   : 726.7                Mean   :28.44   Mean   :172.6   Mean   :70.37  
##  3rd Qu.:1056.0                3rd Qu.:30.00   3rd Qu.:182.0   3rd Qu.:80.00  
##  Max.   :1117.0                Max.   :34.00   Max.   :190.0   Max.   :90.00  
##                                NA's   :32      NA's   :24      NA's   :54     
##        Education       Income      MaritalStatus      Employment 
##             : 15   Min.   :32000          : 20             : 13  
##  Bachelor's :186   1st Qu.:45000   Married:185   Employed  :357  
##  High School:114   Median :48000   Single :295   Unemployed:130  
##  Master's   :104   Mean   :51130                                 
##  PhD        : 77   3rd Qu.:60000                                 
##  NA's       :  4   Max.   :70000                                 
##                    NA's   :86                                    
##      Score       Rating    Category     Color             Hobby    
##  Min.   :5.500    :  6   Art   :187        :  1              :  1  
##  1st Qu.:6.100   A:177   Music :135   Blue :207   Photography: 96  
##  Median :6.200   B:194   Sports:178   Green:156   Reading    :148  
##  Mean   :6.652   C:123                Red  :136   Swimming   :108  
##  3rd Qu.:7.500                                    Traveling  :147  
##  Max.   :8.900                                                     
##  NA's   :73                                                        
##    Happiness       Location  
##  Min.   :6.000         :  1  
##  1st Qu.:7.000   City  :223  
##  Median :7.000   Rural :185  
##  Mean   :7.434   Suburb: 91  
##  3rd Qu.:8.500               
##  Max.   :9.000               
##  NA's   :26

Your Turn

Please answer the following questions, by typing information after the question.

Question 1

What type of variables does your file include?

Answer 1:

The file includes both quantitative and qualitative variables. Quantitative variables measure numerical attributes such as Age, Height, Weight, Income, and Happiness, while qualitative variables describe categorical characteristics such as Gender, Education, MaritalStatus, Employment, Rating, Category, Color, Hobby, and Location.

Question 2

Specific data types?

Answer 2:

The dataset includes the following data types: - Integer: ID, Age, Height, Weight, Income - Numeric (Continuous): Score, Happiness - Factors (categorical): Gender, Education, MaritalStatus, Employment, Rating, Category, Color, Hobby, Location.

Question 3

Are they read properly?

Answer 3:

The variables are read properly and the data types are appropriately assigned. Numeric and integer variables are correctly stored as quantitative fields, while categorical variables are interpreted as factors with defined levels. The dataset structure indicated no unintended type coercion during import.

Question 4

Are there any issues ?

Answer 4:

Several variables have missing values, represented by both NAs and blanks. For example, Age, Height, Weight, Income, Score, and Happiness contain NAs, while variables such as Education, MaritalStatus, Rating, Color, Hobby, and Location have empty factor levels that represent missing values.

Question 5

Does your file includes both NAs and blanks?

Answer 5:

Yes, the summary statistics show both forms of missingness. Numeric variables show NA counts (Age has 32 NAs, Weight has 54) while several factor variables include blank levels (Education has an empty level with 15 observations). This indicates inconsistent encoding of missing values.

Question 6

How many NAs do you have and

Answer 6:

The number of NAs for each variable is as follows:
- Age: 32
- Height: 24
- Weight: 54
- Education: 4
- Income: 86
- Score: 73
- Happiness: 26

For a total of:

# Count the total number of NAs in the dataset
total_nas <- sum(is.na(df))
total_nas # Print the total number of missing values

## [1] 299

Question 7

How many blanks?

Answer 7:

The number of blanks (empty string) for each categorical variable, which represent missing information not initially coded as NA. Based on the summary statistics, the count of blanks by variable is as follows:
- Education: 15
- MaritalStatus: 20
- Employment: 13
- Rating: 6
- Color: 1
- Hobby: 1
- Location: 1

Cleanup Continued

Step 1: Handling both blanks and NAs is not simple so first we want to eliminate some of those, let’s eliminate the blanks and change them to NAs

#
# Step 1:  # Handling both blanks and NAs is not simple so first we want to eliminate
# some of those, let's eliminate the blanks and change them to NAs
#


# Replace blanks with NAs across the dataset
# This ensures that blank values are consistently treated as missing data
df[df == ""] <- NA

# Convert specific columns to factors
# This step ensures categorical variables are treated correctly after replacing blanks
factor_columns <- c("Gender", "Education", "Rating", "MaritalStatus", "Category", 
                    "Employment", "Color", "Hobby", "Location")
df[factor_columns] <- lapply(df[factor_columns], function(col) as.factor(as.character(col)))

Cleanup Continued

Step 2: Count NAs in the entire dataset

#
# Step 2: Count NAs in the entire dataset


# Count the total number of NAs in the dataset
total_nas <- sum(is.na(df))
total_nas # Print the total number of missing values

## [1] 356

Your Turn

Please answer the following questions, by typing information after the question.

Question 8

Explain what the printed number is, what is the information that relays and how can you use it in your analysis?

Answer 8:

The number calculated above is the total number of missing values across all cells in the dataset after replacing the blank rows by NA.

This information relays:
- the magnitude of missingness of the data, while providing a data quality signal.
- It also helps create a baseline metric to track the process as you clean the data(e.g., imputation).

The number can be used in my analysis to:
- calculate the overall percentage of missing data,
- identify where missingness is concentrated,
- find rows with missing data
- Diagnose missingness mechanism (MCAR/MAR/MNAR)
- Track cleaning progress

Clean Up Continued

Step 3: Count rows with NAs.

#
# Step 3: Count rows with NAs
#

# Count rows with at least one NA
rows_with_nas <- sum(rowSums(is.na(df)) > 0)
Percent_row_NA <- percent(rows_with_nas / nrow(df)) # Percentage of rows with NAs
rows_with_nas

## [1] 286

Percent_row_NA

## [1] "57%"

Your Turn

Question 9

How large is the proportion of the rows with NAs, we can drop up to 5%?

Answer 9:

The file contains 286 rows with at least 1 missing value, representing 57% of the sample. This proportion exceeds the commonly accepted threshold of 5% for deletion; therefore, dropping these rows is not feasible.

Question 10

Do you think that would be wise to drop the above percent?

Answer 10:

No, dropping 57% of the observations will result in a significant loss of information, drastically reduce the statistical power, weaken representativeness, and increase the risk of biased results.

Question 11

How this will affect your dataset?

Answer 11:

Dropping 57% of the dataset would compromise inferential validity and model stability. It would reduce the ability to detect real relationships, distort the population estimates, and undermine model generalization. Such a large deletion would also violate best practices in data science and statistical analysis.

CleanUp Continued

Step 4: Count columns with NAs

#  
# Step 4: Count columns with NAs

# Count columns with at least one NA
cols_with_nas <- sum(colSums(is.na(df)) > 0)
Percent_col_NA <- percent(cols_with_nas / length(df)) # Percentage of columns with NAs
cols_with_nas

## [1] 13

Percent_col_NA

## [1] "81%"

Your Turn

Question 12

How large is the proportion of the cols with NAs, we never want to drop entire columnes as this would mean that we will loose variables and associations but do you think that would be wise to drop the above percent?

Answer 12:

Approximately 81% of the columns have at least one missing value. Dropping them would eliminate most variables in the dataset, resulting in severe information loss, the removal of potentially important predictors, and adamaging impact on the dataset. This technique would not be wise.

Question 13

How this will affect your dataset?

Answer 13:

Dropping columns with missing values would reduce the dataset’s dimensionality, remove key variables, and destroy meaningful associations between predictors. By retaining all variables, we keep all potential predictors, which is crucial for establishing relationships between variables.

Imputation

Step 5: Replace NAs with appropriate values (mean for numeric and integer,mode for factor, “NA” for character)

In later weeks we will learn how to replace the NAs properly based on the descriptive statistics and you will discuss this code.For now, you can assume that by setting the mean of the variable for numeric and mode for categorical it is correct - this is not always the case of course but the code will become much more complicated in that case.

# 
# Step 5: Replace NAs with appropriate values (mean for numeric and integer,
# mode for factor, "NA" for character)
# In later weeks we will learn how to replace the NAs properly based on the
# descriptive statistics and you will discuss this code.
# for now, you can assume that by setting the mean of the variable for numeric
# and mode for categorical it is correct - this is not always the case of course
# but the code will become much more complicated in that case.


# Replace NAs with appropriate values
# Numeric: Replace with the mean if sufficient data is available
# Categorical: Replace with the mode (most common value)
# Character: Replace with the string "NA"
df <- lapply(df, function(col) {
  if (is.numeric(col) || is.integer(col)) { # Numeric or integer columns
    if (sum(!is.na(col)) > 10) {
      col[is.na(col)] <- mean(col, na.rm = TRUE) # Replace with mean
    } else {
      col[is.na(col)] <- approx(seq_along(col), col, n = length(col))[["y"]][is.na(col)] # Interpolation
    }
  } else if (is.factor(col)) { # Factor columns
    mode_val <- names(sort(-table(col)))[1] # Mode (most common value)
    col[is.na(col)] <- mode_val
  } else if (is.character(col)) { # Character columns
    col[is.na(col)] <- "NA" # Replace with "NA"
  }
  return(col) # Return the modified column
})

df <- as.data.frame(df) # Convert the list back to a dataframe


#
# following the above method to impute, has now changed some of the statistics


# Check the updated dataset and ensure no remaining NAs
summary(df)

##        ID            Gender         Age            Height          Weight     
##  Min.   :   1.0   Female:206   Min.   :25.00   Min.   :155.0   Min.   :54.00  
##  1st Qu.:  50.0   Male  :185   1st Qu.:27.00   1st Qu.:165.0   1st Qu.:65.00  
##  Median : 991.0   Other :109   Median :28.00   Median :172.6   Median :70.00  
##  Mean   : 726.7                Mean   :28.44   Mean   :172.6   Mean   :70.37  
##  3rd Qu.:1056.0                3rd Qu.:30.00   3rd Qu.:182.0   3rd Qu.:80.00  
##  Max.   :1117.0                Max.   :34.00   Max.   :190.0   Max.   :90.00  
##        Education       Income      MaritalStatus      Employment 
##  Bachelor's :205   Min.   :32000   Married:185   Employed  :370  
##  High School:114   1st Qu.:45000   Single :315   Unemployed:130  
##  Master's   :104   Median :51130                                 
##  PhD        : 77   Mean   :51130                                 
##                    3rd Qu.:60000                                 
##                    Max.   :70000                                 
##      Score       Rating    Category     Color             Hobby    
##  Min.   :5.500   A:177   Art   :187   Blue :208   Photography: 96  
##  1st Qu.:6.100   B:200   Music :135   Green:156   Reading    :149  
##  Median :6.500   C:123   Sports:178   Red  :136   Swimming   :108  
##  Mean   :6.652                                    Traveling  :147  
##  3rd Qu.:7.300                                                     
##  Max.   :8.900                                                     
##    Happiness       Location  
##  Min.   :6.000   City  :224  
##  1st Qu.:7.000   Rural :185  
##  Median :7.000   Suburb: 91  
##  Mean   :7.434               
##  3rd Qu.:8.200               
##  Max.   :9.000

# Count columns with at least one NA
cols_with_nas_2 <- sum(colSums(is.na(df)) > 0)
Percent_col_NA_2 <- percent(cols_with_nas_2 / length(df)) # Percentage of columns with NAs
cols_with_nas_2

## [1] 0

Percent_col_NA_2

## [1] "0%"

Your Turn

Essay Question

Run summary(df) and compare with the previous statistics. Do you observe any undesired changes? Explain in detail. Are there any more NA’s in your file?What is the information that is printed by the summary? How can this be interpreted? what are your observations? Verify the effects of imputation, and explain in detail. Compare the updated summary with the earlier statistics and note changes. Explain everything that you obsevrve.

Answer

After running summary(df) on the imputed dataset and comparing it with the earlier summary, several changes stand out. Some of them are predictable, while others raise more important statistical concerns.
The most obvious change is that missing values are replaced.
Before imputation, many variables showed NAs directly in the summary or indirectly in reduced counts. After applying the imputation rules, every numeric, integer, factor, and character column has been replaced. This is confirmed both by summary(df), which no longer showed any missing values, and by colSum(is.na(df)), where the count of columns containing at least one NA decreased to zero. From a technical perspective, the dataset is now complete and compatible with methods that can’t handle missing values.

Even if the completeness is helpful, it comes at a cost. The latest summary(df) provides a compact view of how imputation reshapes the data structure. For numeric and integer variables, the reported statistics(minimum, 1st quartile, median, mean, 3rd quartile, maximum) reflect combined observed and imputed values. Before imputation, the measures only reflect the real observations. After imputation, the missing value replaced with the mean. The mean may shift slightly, but the more significant impact is the reduced variability. Quartiles often move closer together, and the median tends to drift toward the mean. Adding many identical values compress the distribution and make the data look more stable and regular than it truly was. When a variable has substantial missingness, this artificial tightening can be visible.

Variables imputed through interpolation show a different distortion. Interpolation can introduce smooth, almost linear patterns that may not reflect the real data-generating process. While the minimum and maximum are often plausible, the quartiles and median can suggest an orderly structure that never existed in the original dataset. In these cases, the summary statistics no longer represent observed behavior but rather a constructed approximation.

Factor variables present a different type of change. The summary of factors lists the counts of the most frequent levels. Before imputation, the distribution of categories reflected only the observed data, potentially underrepresenting some levelsdue to missingness. After imputation, all missing factor values have been replaced with the mode—the most common category. As a result, the dominant level becomes even more dominant. In the updated summary, you may notice that one category’s count has increased substantially compared to the earlier statistics, while the relative frequencies of other levels have decreased. This inflates the majority class and can bias any downstream analysis or model toward that category, especially in classification tasks. It also diminishes the influence of minority categories, potentially masking rare but important patterns.

Character variables are handled differently: missing values are replaced with the literal string NA. In the summary, character columns usually display their length, class, and mode rather than detailed frequency distributions. However, conceptually, the dataset now treats NA as an actual observed value rather than a missing entry. If these columns are later converted to factors, NA will appear as a genuine category. This can be useful if you intend to model “missingness” as a meaningful state, but it can also be misleading if you forget that this label does not correspond to a real observed category in the original data. The summary itself may not fully expose this nuance, but it is an important observation when interpreting the transformed dataset.

When comparing the updated summary with the earlier one, several overarching patterns emerge. First, there are no more NAs, which is operationally convenient but conceptually important: the uncertainty that was previously explicit is now hidden inside imputed values. Second, numeric variables often show reduced spread and more central clustering, a direct consequence of mean imputation and interpolation. Third, factor variables exhibit a more skewed distribution toward the mode, reflecting the decision to fill all missing categories with the most frequent level. Fourth, character variables now implicitly encode missingness as a string, which may or may not align with the substantive meaning of the data.

These changes illustrate both the power and the risk of imputation. On the positive side, you have preserved all columns—no variables were dropped—so you retain the full set of potential predictors and relationships. This aligns with the goal of not losing variables and associations, especially when 81% of columns had at least one missing value. On the other hand, the imputation strategy has altered the statistical properties of the dataset: variability has been reduced, majority categories have been reinforced, and some patterns may now reflect the imputation mechanism more than the original data‑generating process.

In summary, the post‑imputation summary(df) confirms that there are no remaining NAs and that all variables are still present, but it also reveals subtle and sometimes undesired changes in distributions, variability, and category frequencies. These observations highlight the importance of interpreting imputed data with care: while the dataset is now complete and usable for many algorithms, the underlying uncertainty and structure have been reshaped by the imputation choices you made.

Descriptive Statistics

Step 6: Create descriptive statistics for all variables

We run all the descriptive statistics for all the numeric variables

################################################################### 
# 
# Step 6: Create descriptive statistics for all variables
# We run all the descriptive statistics for all the numeric variables
#
###################################################################
# Initialize a function to compute descriptive statistics
compute_stats <- function(column, name) {
  if (is.numeric(column) || is.integer(column)) {
    data.frame(
      Variable = name,
      Mean = round(mean(column, na.rm = TRUE), 2),
      Median = round(median(column, na.rm = TRUE), 2),
      St.Deviation = round(sd(column, na.rm = TRUE), 2),
      Range = round(diff(range(column, na.rm = TRUE)), 2),
      IQR = round(IQR(column, na.rm = TRUE), 2),
      Skewness = round(skewness(column, na.rm = TRUE), 2),
      Kurtosis = round(kurtosis(column, na.rm = TRUE), 2),
      stringsAsFactors = FALSE
    )
  } else {
    NULL
  }
}

# Apply the function to each numeric or integer column in the dataset
descriptive_stats <- do.call(
  rbind,
  lapply(names(df), function(col) compute_stats(df[[col]], col))
)

# Print the descriptive statistics dataframe
descriptive_stats

##    Variable     Mean   Median St.Deviation   Range     IQR Skewness Kurtosis
## 1        ID   726.73   991.00       461.17  1116.0  1006.0    -0.83     1.72
## 2       Age    28.44    28.00         1.97     9.0     3.0     0.69     3.39
## 3    Height   172.56   172.56         9.39    35.0    17.0    -0.03     1.82
## 4    Weight    70.37    70.00        10.19    36.0    15.0     0.06     1.98
## 5    Income 51130.43 51130.43      8447.05 38000.0 15000.0     0.19     2.70
## 6     Score     6.65     6.50         0.85     3.4     1.2     0.74     2.76
## 7 Happiness     7.43     7.00         0.97     3.0     1.2     0.13     1.91

Descriptive Statistics Continued

Step 7: Print Descriptive Statistics

Now you have all the descriptive statistics for all numeric variables Create a professional table in your paper. The library(KableExtra), can help you create the table here. If you have no programming experience you can cut and paste in Excel and beautify the table in Excel.

#############################################################
# 
# Step 7: Print Descriptive Statistics
# Now you have all the descriptive statistics for all numeric variables
# Create a professional table in your paper.
# the library(KableExtra), can help you create the table here.
# if you have no programming experience you can cut and paste in Excel
# and beautify the table in Excel
#############################################################
  
  print("Descriptive Statistics:")

## [1] "Descriptive Statistics:"

  print(descriptive_stats)

##    Variable     Mean   Median St.Deviation   Range     IQR Skewness Kurtosis
## 1        ID   726.73   991.00       461.17  1116.0  1006.0    -0.83     1.72
## 2       Age    28.44    28.00         1.97     9.0     3.0     0.69     3.39
## 3    Height   172.56   172.56         9.39    35.0    17.0    -0.03     1.82
## 4    Weight    70.37    70.00        10.19    36.0    15.0     0.06     1.98
## 5    Income 51130.43 51130.43      8447.05 38000.0 15000.0     0.19     2.70
## 6     Score     6.65     6.50         0.85     3.4     1.2     0.74     2.76
## 7 Happiness     7.43     7.00         0.97     3.0     1.2     0.13     1.91

  library(kableExtra)
  kable(descriptive_stats, caption = "Descriptive Statistics", align = 'c') %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, bold = TRUE, background = "#D3D3D3")  # Highlight header

Descriptive Statistics
Variable	Mean	Median	St.Deviation	Range	IQR	Skewness	Kurtosis
ID	726.73	991.00	461.17	1116.0	1006.0	-0.83	1.72
Age	28.44	28.00	1.97	9.0	3.0	0.69	3.39
Height	172.56	172.56	9.39	35.0	17.0	-0.03	1.82
Weight	70.37	70.00	10.19	36.0	15.0	0.06	1.98
Income	51130.43	51130.43	8447.05	38000.0	15000.0	0.19	2.70
Score	6.65	6.50	0.85	3.4	1.2	0.74	2.76
Happiness	7.43	7.00	0.97	3.0	1.2	0.13	1.91

Your Turn

Essay Question

Review and compare with the previous statistics. Do you observe any undesired changes? Explain in detail. How can this be interpreted? What are your observations? Verify the descriptive statistics, and explain in detail. Explain everything that you observe. Complete your research, compare your variables, and complete your paper

Answer

A comparison of the descriptive statistics before and after data cleaning and imputation reveals several important observations. Overall, the imputation process introduces no severe or undesired distortions, although some expected and interpretable changes are evident.

After imputation, all numeric variables report complete summary measures, confirming that missing values have been successfully addressed. Measures of central tendency, particularly the mean and median, remain closely aligned for most variables, indicating that the imputation process did not substantially bias the data. For example, variables such as Age, Height, Weight, Score, and Happiness show minimal divergence between mean and median, suggesting approximately symmetric distributions and stable central values.

A slight reduction in variability is observed in some variables, which is a known side effect of mean-based imputation. This is most visible in Income, where the standard deviation remains relatively large but marginally smoother than expected in raw data. This indicates that while imputation preserved the overall scale and spread of income values, it likely reduced extreme variability by replacing missing observations with average values. This effect is acceptable at this stage of exploratory analysis, but should be acknowledged when interpreting results.

Skewness and kurtosis values provide further insight into distributional shape. Age and Score display moderate positive skewness, indicating a slight concentration of observations at lower values. Income shows near-symmetry with mild kurtosis, suggesting a distribution that is neither excessively peaked nor heavy-tailed. Happiness demonstrates near-normal behavior with low skewness and kurtosis, reinforcing its suitability for standard parametric analysis.

One undesired but expected issue is the treatment of ID as a numeric variable. While the descriptive statistics correctly compute its mean, range, and dispersion, these values are not analytically meaningful. ID functions purely as an identifier, not a quantitative measure, and its inclusion in numeric summaries inflates variability metrics without adding interpretive value. This variable should be excluded from analytical modeling and inferential conclusions.

The descriptive statistics verify internal consistency across variables. Ranges and interquartile ranges align logically with the nature of each variable. For instance, Age and Happiness have narrow ranges reflecting constrained measurement scales, while Income exhibits a much wider range consistent with real-world income variation. Standard deviations increase proportionally with scale, confirming that the data behave as expected.

In summary, the imputation process successfully removed missing values without introducing meaningful bias into the dataset. Minor smoothing of variability is observed, which is an anticipated trade-off of simple imputation techniques. The dataset is now complete, statistically stable, and suitable for further analysis. Future research stages should refine imputation strategies and exclude non-analytic identifiers such as ID to ensure optimal model validity and interpretability. ## Visual Representations

Step 8: Create graphs using ggplot2

For this part there are parts that you will need to change to create your graphs. The example is set to work with Income. Make the necessary changes to create the rest of the graphs. You may also want to change the colors, the dimensions etc…

  #######################################################################
  # 
  # Step 8: Create graphs using ggplot2
  # For this part there are parts that you will need to change to create 
  # your graphs.
  # The example is set to work with Income
  # Make the necessary changes to create the rest of the graphs
  # You may also want to change the colors, the dimensions etc...
  #############################################################
  
  #############################################################
  #
  # STEP 8a: Create a bargraph or a histogram
  # Explain what graph was that and why?
  # Set col to the desired column name
  #############################################################
  #
  ##
  # In this code we start you of with an example of Happiness, later in the code
  # you should replace this with your desired variable.
  #
  
    col = "Happiness"  # This is an example, try to do the same with a different variable

Bargraph - if the variable of your choice is categorical

  # Assume df is your dataframe and col is the column name (as string)
if (is.factor(df[[col]])) {
  # Bar graph for factors
  ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
    geom_bar() +
    labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
    theme_minimal() +
    theme(legend.position = "right")
  
}

Histogram - If the varaibles of your choice is Numerical

You can also copy the chunk and create more graphs by resetting the col variable appropriately

if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
  # Histogram for numeric variables
  ggplot(df, aes(x = .data[[col]])) +
    geom_histogram(bins = 30, fill = "steelblue", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

col = "Age"
if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
  # Histogram for numeric variables
  ggplot(df, aes(x = .data[[col]])) +
    geom_histogram(bins = 30, fill = "steelblue", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

col = "Score"
if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
  # Histogram for numeric variables
  ggplot(df, aes(x = .data[[col]])) +
    geom_histogram(bins = 30, fill = "steelblue", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

col = "Income"
if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
  # Histogram for numeric variables
  ggplot(df, aes(x = .data[[col]])) +
    geom_histogram(bins = 30, fill = "steelblue", color = "black") +
    labs(title = paste("Histogram for", col), x = col, y = "Frequency") +
    theme_minimal()
}

Your Turn

Essay Question

Now that you can observe graphically your data, explain the importance of graphical representations and how this helps to communicate data with other parties. Explain what graph was that and why?

Answer

Graphical representations play a critical role in data analysis because they translate numerical results into patterns that are easy to see, interpret, and communicate. While summary statistics describe data using single values, graphs reveal the underlying distribution, concentration, spread, and unusual behavior that numbers alone can hide. This makes visualizations especially effective when sharing findings with instructors, stakeholders, or non-technical audiences.

The graph shown is a histogram of the variable Score. A histogram is the appropriate choice because Score is a numeric variable measured on a continuous scale. This type of graph groups values into bins and displays how frequently observations fall within each range. Unlike bar charts, which are designed for categorical data, histograms allow us to assess distributional shape and variability.

From the histogram, it is clear that most score values cluster between approximately 6 and 7.5, indicating a strong central concentration. This visually confirms the descriptive statistics, where the mean (6.65) and median (6.5) are close, suggesting a fairly symmetric distribution. The histogram also shows fewer observations at the lower and higher ends, with a small number of higher scores approaching 9. This supports the slight positive skewness observed numerically and suggests the presence of mild upper-end variability, but no extreme outliers.

Graphical representations like this histogram improve communication by making abstract statistics intuitive. They help identify patterns such as clustering, skewness, gaps, and potential anomalies at a glance. In collaborative or academic settings, visuals support clearer discussion, reduce misinterpretation, and strengthen conclusions by visually validating statistical results. In this case, the histogram reinforces that the Score variable is well-behaved, moderately dispersed, and suitable for further analysis.

Your Turn

STEP 8b: Create a boxplot and a Histogram for numeric variables note the the Bin width cannot be set up in the same way to work with Age or Happiness that has a small range and Income that the range is in thousands. Change this appropriately

Please note that this part of the code will not run for the demo code. You will need to change the value of eval=FALSE to eval=TRUE, after you introduce your code, to run it and add it to your knitted file.

 #############################################################
      #
      # STEP 8b: Create a boxplot  and Histogram for numeric variables
      # note the the Bin width cannot be set up in the same way to work with 
      # Age or Happiness that has a small range and Income that the range is in thousands
      # Change this appropriately
      #############################################################
      #
      # Choose a numeric variable (i.e., Age) set the col variable to the name of the column then you rerun the code that is commented out here.

#col = ____ Add the variable of your choice  

# Uncomment the code and you will create a Bar graph or a Histogram of a different variable here.
# Do not forget to change the value of eval=TRUE to run and knit this chunk
    
  if (is.factor(df[[col]])) { # if the col is categorical, then the code will
      # create two graphs the Bar graph 
      # Highlight and run until the line that start with `# Boxplot for numeric variables
      #
      # If the col is numeric, then it will create the histogram
      # Bar graph for factors
      ggplot(df, aes(x = .data[[col]], fill = .data[[col]])) +
        geom_bar() +
        labs(title = paste("Bar Graph for", col), x = col, y = "Count") +
        theme_minimal() +
        theme(legend.position = "right")
    } else if (is.numeric(df[[col]]) || is.integer(df[[col]])) {
      
      ggplot(df, aes(x = .data[[col]])) +
        geom_histogram(binwidth = 0.3) +
        labs(title = paste("Histogram for", col), x = col, y = "Count") +
        theme_minimal()
}

Your Turn

Essay Question

Now explain this graph. Focus on the information extracted, anomalies, outliers, relationships.

Answer

The histogram for Age provides a clear visual summary of how observations are distributed across the sample. Most ages are concentrated between the mid-20s and early 30s, with the highest frequency occurring around 27 to 30 years old. This confirms the descriptive statistics, where the mean (28.44) and median (28) are very close, indicating a stable central tendency and a fairly balanced distribution.

The shape of the histogram suggests a slight right skew, which aligns with the positive skewness reported in the descriptive statistics. There are fewer observations at the higher end of the age range, particularly above 32, which pulls the distribution slightly to the right. However, this skewness is mild and does not indicate severe imbalance.

In terms of anomalies and outliers, there are no extreme or isolated values. The minimum and maximum ages fall within a reasonable and expected range, and all observations appear to follow a natural progression without abrupt gaps. The narrow range and relatively small spread reflect the low standard deviation and interquartile range, indicating that the sample is fairly homogeneous with respect to age.

No direct relationships with other variables can be inferred from this univariate graph alone, but the consistency and tight clustering suggest that age is unlikely to introduce high variability into models or analyses. Overall, the histogram confirms that Age is well-distributed, free of significant outliers, and suitable for further statistical analysis without the need for transformation or additional data cleaning.

Your Turn

***Step 8c: NOTE that you should run this part with the latest value of col. Do not forget to change the eval=TRUE to knit it.

Boxplot for numeric variables

       #############################################################
       #
       # Step 8c
       # NOTE that you should run this part of the code after you 
       #  copy the graph that the previous code creates. Boxplot for numeric variables
       #############################################################
      # The next 5 lines will run only if the col is numeric, otherwise will give you an error.
      
  
       ggplot(df, aes(x = "", y = .data[[col]])) +
  geom_boxplot(fill = "skyblue", color = "darkblue", width = 0.3, outlier.color = "red", outlier.size = 2) +
  labs(
    title = paste("Box Plot for", col),
    x = NULL,
    y = "Value"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.y = element_text(size = 14),
    axis.text.y = element_text(size = 12)
  )

       #############################################################
       #
       # Step 8c
       # NOTE that you should run this part of the code after you 
       #  copy the graph that the previous code creates. Boxplot for numeric variables
       #############################################################
      # The next 5 lines will run only if the col is numeric, otherwise will give you an error.
      
  col =  "Happiness"
       ggplot(df, aes(x = "", y = .data[[col]])) +
  geom_boxplot(fill = "skyblue", color = "darkblue", width = 0.3, outlier.color = "red", outlier.size = 2) +
  labs(
    title = paste("Box Plot for", col),
    x = NULL,
    y = "Value"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.y = element_text(size = 14),
    axis.text.y = element_text(size = 12)
  )

Your Turn

Essay Question

Explain the findings of your Boxplot. Are there any outliers? What is the IQR? Focus on the information extracted, anomalies, outliers, relationships.

Answer

The boxplot for Happiness summarizes the distribution by highlighting central tendency, spread, and potential outliers. The median value is located near 7, indicating that at least half of the observations report happiness levels at or above this point. This aligns with the descriptive statistics, where the median is 7 and the mean is slightly higher at 7.43, suggesting a generally positive happiness level across the sample.

The interquartile range (IQR), which represents the middle 50% of the data, spans approximately from 7 to about 8.2. This indicates a relatively tight clustering of happiness scores, showing low variability among most respondents. The narrow IQR suggests that individual responses are fairly consistent and concentrated within a small range.

There are no visible outliers in the boxplot. All data points fall within the whiskers, meaning there are no extreme values beyond 1.5 times the IQR. This confirms that the distribution does not contain unusually low or high happiness scores that would require special treatment or raise concerns about data quality.

The whiskers extend from roughly 6 to 9, capturing the full observed range of the variable. The slightly longer upper whisker reflects mild right-side spread, which is consistent with the small positive skewness observed in the descriptive statistics. Overall, the boxplot shows a stable distribution with no anomalies, limited dispersion, and no extreme values. The Happiness variable appears well-behaved and suitable for further analysis without transformation.

Tabular Representations

Step 9: Tables

Creating tables to understand how the different categorical variables interconnect. Tabular information can be provided in both tables and parallel barplots. The following is an example on two variables, choose two others to get more valuable insights.

#############################################################
#
# Step 9
#
# Creating tables to understand how the different categorical variables
# interconnect
# Tabular information can be provided in both tables and parallel barplots.
# The following is an example on two variables, choose two others to get
# more valuable insights.
#############################################################

Gender_Education <- table(df$Education, df$Gender)
Gender_Education # what does this information tells you?

##              
##               Female Male Other
##   Bachelor's     185    9    11
##   High School     10   19    85
##   Master's        11   93     0
##   PhD              0   64    13

# How Many rows are there?
# This is the number of colors you should have in the vector below
# more intuitive colors can be added here.
# Keep the order from top to bottom to create your legend vector
addmargins(Gender_Education) # Add totals to your table

##              
##               Female Male Other Sum
##   Bachelor's     185    9    11 205
##   High School     10   19    85 114
##   Master's        11   93     0 104
##   PhD              0   64    13  77
##   Sum            206  185   109 500

color <- c("red","blue","yellow","green")
names <- c("Bachelor's","High School", "Master's","PhD")
barplot(Gender_Education, col=color, beside= TRUE, main = "Education by Gender", ylim = c(0,250) )
legend("topright",names,fill=color,cex=0.5)

# topright is the position of the legend, it can be moved to top, left bottom, etc...
# you do not change the rest of the parameters here
print(addmargins(Gender_Education))

##              
##               Female Male Other Sum
##   Bachelor's     185    9    11 205
##   High School     10   19    85 114
##   Master's        11   93     0 104
##   PhD              0   64    13  77
##   Sum            206  185   109 500

Making Pretty Tables

library(knitr)
library(kableExtra)

# Create the contingency table
Gender_Education <- table(df$Education, df$Gender)

# Add row and column totals
Gender_Education_margins <- addmargins(Gender_Education)

# Make a clean and beautiful table with kable
kable(Gender_Education_margins, caption = "Gender by Education Level", align = 'c') %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, bold = TRUE, background = "#D3D3D3")  # Highlight header

Gender by Education Level
	Female	Male	Other	Sum
Bachelor’s	185	9	11	205
High School	10	19	85	114
Master’s	11	93	0	104
PhD	0	64	13	77
Sum	206	185	109	500

Location_Education <- table(df$Education, df$Location)
Location_Education # what does this information tells you?

##              
##               City Rural Suburb
##   Bachelor's   130    75      0
##   High School    4   110      0
##   Master's      89     0     15
##   PhD            1     0     76

# How Many rows are there? 4 rows
# This is the number of colors you should have in the vector below
# more intuitive colors can be added here.
# Keep the order from top to bottom to create your legend vector
addmargins(Location_Education) # Add totals to your table

##              
##               City Rural Suburb Sum
##   Bachelor's   130    75      0 205
##   High School    4   110      0 114
##   Master's      89     0     15 104
##   PhD            1     0     76  77
##   Sum          224   185     91 500

color <- c("red","blue","yellow","green")
names <- c("Bachelor's","High School", "Master's","PhD")
barplot(Location_Education, col=color, beside= TRUE, main = "Education by Location", ylim = c(0,250) )
legend("topright",names,fill=color,cex=0.5)

# top right is the position of the legend, it can be moved to top, left bottom, etc...
# you do not change the rest of the parameters here
print(addmargins(Location_Education))

##              
##               City Rural Suburb Sum
##   Bachelor's   130    75      0 205
##   High School    4   110      0 114
##   Master's      89     0     15 104
##   PhD            1     0     76  77
##   Sum          224   185     91 500

# Make a clean and beautiful table with kable
kable(Location_Education, caption = "Table Education by Location", align = 'c') %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, bold = TRUE, background = "#D3D3D3")  # Highlight header

Table Education by Location
	City	Rural	Suburb
Bachelor’s	130	75	0
High School	4	110	0
Master’s	89	0	15
PhD	1	0	76

Your Turn

Essay Question

Explain the table in details. Focus on the information extracted, anomalies, outliers, relationships.

Answer

The table presents a cross-tabulation of Gender by Education Level, allowing us to examine how educational attainment is distributed across gender categories. Each cell represents the count of individuals within a specific combination of gender and education, while the margins provide totals for both rows and columns.

Several clear patterns emerge from the table. At the Bachelor’s level, females represent the overwhelming majority, with 185 individuals compared to only 9 males and 11 categorized as other. This suggests a strong concentration of females holding bachelor’s degrees in the dataset. In contrast, the High School category is dominated by individuals classified as “Other,” with 85 observations, while females and males appear in much smaller numbers. This imbalance is notable and may reflect either the structure of the sampled population or differences in how gender categories are represented at lower education levels.

At the Master’s level, males are the dominant group, with 93 observations, while females account for only 11 and no individuals are recorded under the “Other” category. A similar pattern appears at the PhD level, where males again make up the majority with 64 observations, followed by 13 in the “Other” category and none classified as female. These results indicate a strong association between higher education levels and male representation in this dataset.

The totals at the bottom of the table show that the dataset is relatively balanced between females (206) and males (185), with a substantial number of individuals classified as “Other” (109). However, the distribution across education levels is far from uniform. The absence of females at the PhD level and the absence of “Other” individuals at the Master’s level stand out as anomalies. While these are not statistical outliers in the traditional sense, they represent structural gaps that could influence downstream analysis.

Overall, the table highlights a clear relationship between gender and education level. Lower education levels show more diversity across gender categories, while higher education levels are increasingly concentrated among males. This pattern suggests potential underlying social, demographic, or sampling effects and emphasizes the importance of considering categorical relationships when interpreting results from the dataset.

Predictive Modeling

Step 10 - Linear Regression and Correlation

Use the following chunk as a compass. Choose two numeric variables and run the following regression. Choose different variables than the ones presented below.

#############################################################
#
# Step 10 - Linear Regression and Scatterplots
# Choose two numeric variables and run the following regression.
# Do not use the following two variables
# The code is presented as an example
#
# We separate the numerical variables and review their relationships
# The numerical variables are in columns 3,4,5,7, 10, 15
#############################################################

temp_df <- df[c(3:5,7,10,15)] # we only select the numeric variables
pairs(temp_df) # this creates a correlation matrix

Your Turn

Essay Question

Explain the Correlation Matrix and the heat map in detail, what relationships can you identify, what trends, why they are important? Can you tie this to your beliefs and understanding of similar data?

Answer

The correlation matrix in the scatterplot matrix shows the pairwise relationships among the numeric variables, including Age, Height, Weight, Income, Score, and Happiness.
Each panel represents the relationship between 2 variables, where individual points indicate observations. The direction, clustering, and spread of the points provide insight into the strength and nature of the correlations between variables.

From the graph, a clear positive relationship between Height and Weight is visible. The data points show an upward trend, indicating that individuals with greater height tend to have higher weight.
The clustering of points around a diagonal trend suggests a moderately strong positive correlation. The relationship is expected to confirm the dataset’s realistic physical characteristics.

The relationship between Income and Score also shows a slight positive trend, indicating a weak to moderate positive correlation. The scatterplot panels suggest that individuals with higher income levels tend to have somewhat higher performance scores. However, the data points are widely dispersed, which indicates that the relationship exists but is not strongly predictive. This suggests that additional factors beyond income likely influence performance outcomes.

The graph also shows a mild positive trend between Score and Happiness. The data show an upward trend, indicating that individuals with higher scores tend to report higher levels of happiness. The trend is not tightly clustered, suggesting that happiness is influenced by multiple variables rather than solely by performance measures.

On the other hand, when we compared Age with other variables, we did not have any strong directional patterns. This indicates a weak correlation between Age and the other variables, including Income, Score, and Happiness. The lack of clustering or linear structure suggests that age doesn’t influence the outcomes within the dataset.

In sum, the scatterplot correlation matrix provides a visual confirmation of the relationships among the numeric variables. Stronger relationships appear as structured diagonal patterns, while weaker relationships appear as widely scattered data points. The patterns help identify which variables move together and which operate independently, making correlation an essential tool for guiding predictive modeling and understanding the dataset’s structure.

#############################################################
#
# Step 11 Run a regression model
# Make the necessary changes below to run your own regression
# Answer the questions in your paper
#
#############################################################


r <- lm(Income~Age, data=df) # it runs the least squares
r

## 
## Call:
## lm(formula = Income ~ Age, data = df)
## 
## Coefficients:
## (Intercept)          Age  
##       -6316         2020

summary(r) # Information about your variables, R^2 and p value are printed

## 
## Call:
## lm(formula = Income ~ Age, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22276.6  -3217.3    893.4   2743.2  18869.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -6316.5     4843.7  -1.304    0.193    
## Age           2019.8      169.9  11.889   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7463 on 498 degrees of freedom
## Multiple R-squared:  0.2211, Adjusted R-squared:  0.2195 
## F-statistic: 141.3 on 1 and 498 DF,  p-value: < 2.2e-16

# In your example you should change the title and the labels of the axis appropriately
# Change Colors

Your Turn

Essay Question

Explain the Results that you receive What does the R^2 means in this example? what about the p-value? Why they are important? Can you tie this to your beliefs and understanding of similar data?

Answer

The linear regression model examines the relationship between the independent variable Age and the dependent variable Income.
The regression equation is:

\[\text{Income} = -6316 + 2020 \times \text{Age}\]

The equation indicates that when Age increases by one year, the predicted Income increases by approximately $2020, holding all else constant. The positive slope confirms a positive relationship between age and income, suggesting that as individuals grow older, their earnings tend to increase..

The intercept (-6316) represents the predicted income when Age equals zero. While this value is not meaningful in a practical sense, since age zero is outside the relevant range of the data, the intercept is necessary for defining the regression line mathematically and does not affect the interpretation of the slope.

The p-value associated with the Age coefficient is less than $2 \times 10^{-16}$, well below the standard significance threshold of 0.05. This indicates that the relationship between Age and Income is statistically significant, meaning the observed association is very unlikely to be due to chance. The large t-value (11.889) further supports the strength and reliability of this relationship. In contrast, the intercept is not statistically significant, which is not a concern given its limited substantive interpretation.

The $R^2$ value of 0.2211 indicates that approximately 22.1% of the variability in Income is explained by Age alone. While this is a moderate level of explanatory power, it is reasonable for a single-predictor model involving economic outcomes, which are typically influenced by many factors. The adjusted $R^2$ (0.2195) is nearly identical, suggesting that the model is stable and not overfitting the data.

The residual standard error (7,463) reflects the average deviation of observed income values from the model’s predictions. Given the wide range of income values, this level of residual variability is expected and highlights that income is shaped by additional variables such as education, occupation, industry, location, and individual skills.

These results align well with common understanding and real-world experience. Income generally increases with age as individuals gain experience and advance in their careers, but age alone does not fully determine earnings. The fact that nearly 78% of income variation remains unexplained reinforces the idea that income is multifactorial and influenced by a combination of personal, educational, and structural factors.

Overall, the p-value confirms that Age is a meaningful predictor of Income, while the $R^2$ value quantifies the practical importance of this relationship. Together, these measures demonstrate that age plays a significant, but partial, role in explaining income differences, making this model informative but not exhaustive for understanding income dynamics.

Visual representations

#############################################################
#
# Step 12
#
# Create a scatterplot and add the regression line.
#############################################################
plot(df$Age,df$Income, col = "blue", main = "Income vs Age", xlab = "Age", ylab= "Income")
# plot(df$Score,df$Happiness, col = "blue", main = "Score vs Happiness", xlab = "Happiness", ylab= "Score") # it plots the scatterplot
abline(reg=r, col = "red")          # it adds the regression line

Your Turn

Essay Question

How the Scatterplot provides more or different perspective to the researcher? Please describe the plot but also share your insights of this exploration.

Answer The scatterplot of Income versus Age provides a visual perspective that complements and extends the understanding gained from the regression results and summary statistics. While numerical measures such as the regression coefficient, p-value, and $R^2$ quantify the relationship, the scatterplot shows how individual observations are distributed, revealing patterns, variability, and potential anomalies that are not fully captured by statistics alone.

From the plot, there is a clear positive linear trend, reinforced by the upward-sloping regression line. As age increases, income generally tends to rise, which visually confirms the positive coefficient observed in the linear regression model. Individuals in their mid-to-late 20s tend to cluster around lower income levels, while those in their early-to-mid 30s are more frequently observed at higher income levels. This pattern supports the idea that income increases with age, likely reflecting greater work experience and career progression.

The scatterplot also highlights the spread of income values at similar ages. For example, individuals around age 28–30 show a wide range of incomes, from relatively low to quite high. This dispersion indicates that while age has a meaningful effect on income, it does not fully explain income differences. This visual observation aligns with the $R^2$ value of approximately 0.22, showing that age explains only part of the variability in income, with other factors such as education, occupation, and industry also playing important roles.

Another insight gained from the scatterplot is the absence of extreme outliers that would disproportionately influence the regression line. Although some observations fall above or below the general trend, they remain within a reasonable range. This suggests that the regression results are not driven by a small number of unusual cases and that the relationship between age and income is fairly consistent across the sample.

Finally, the scatterplot helps assess whether a linear model is appropriate. The overall pattern shows no strong curvature or structural breaks, supporting the use of a linear regression model for this relationship. The roughly even distribution of points around the regression line indicates that a linear approximation provides a reasonable summary of the relationship.

Overall, the scatterplot offers a more intuitive and detailed understanding of how income changes with age. It visually confirms the statistical findings, illustrates the degree of variability at different ages, and reinforces the conclusion that age is a significant but incomplete predictor of income. This combination of visual and numerical analysis leads to a more nuanced and credible interpretation of the data.

## Your Turn

Complete the steps that shown above for two different numerical variables

#############################################################
#
# Step 11 Run a regression model
# Make the necessary changes below to run your own regression
# Answer the questions in your paper
#
#############################################################

# CHANGE the variables Income and Age, but always choose numerical variables.
# Do not forget to change the eval=TRUE for this and the following chunk before you knit.

r <- lm(Score ~ Happiness, data = df)# it runs the least squares
r

## 
## Call:
## lm(formula = Score ~ Happiness, data = df)
## 
## Coefficients:
## (Intercept)    Happiness  
##      2.2813       0.5879

summary(r) # Information about your variables, R^2 and p value are printed

## 
## Call:
## lm(formula = Score ~ Happiness, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4843 -0.2964 -0.1964  0.2278  1.5097 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.28126    0.21714   10.51   <2e-16 ***
## Happiness    0.58788    0.02896   20.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6275 on 498 degrees of freedom
## Multiple R-squared:  0.4527, Adjusted R-squared:  0.4516 
## F-statistic:   412 on 1 and 498 DF,  p-value: < 2.2e-16

# In your example you should change the title and the labels of the axis appropriately
# Change Colors

Your Regression DIagnostics

#############################################################
#
# Step 12
#
# Create a scatterplot and add the regression line.
#############################################################

plot(df$Score,df$Happiness, col = "blue", main = "Score vs Happiness", xlab = "Happiness", ylab= "Score") # it plots the scatterplot
abline(reg=r, col = "red")          # it adds the regression line

Your Turn

Essay Question Explain the Results that you receive What does the R^2 means in this example? what about the p-value? Why they are important? Can you tie this to your beliefs and understanding of similar data? How the Scatterplot provides more or different perspective to the researcher? Please describe the plot but also share your insights of this exploration.

Answer

The linear regression model evaluates the relationship between Happiness (independent variable) and Score (dependent variable). The estimated regression equation is:

\[\text{Score} = 2.28 + 0.59 \times \text{Happiness}\]

This result indicates that for every one-unit increase in Happiness, the predicted Score increases by approximately 0.59 units, holding other factors constant. The positive slope confirms a meaningful positive association between happiness and performance, suggesting that individuals who report higher happiness levels tend to achieve higher scores.

The p-value associated with the Happiness coefficient is less than $2 \times 10^{-16}$, which is far below the standard significance threshold of 0.05. This means the relationship between Happiness and Score is statistically significant, and the likelihood that this association occurred by random chance is extremely small. The large t-value (20.30) further reinforces the strength and reliability of this relationship. Statistical significance is important because it confirms that the observed effect is real and not simply due to sampling variability.

The $R^2$ value of 0.4527 indicates that approximately 45.3% of the variability in Score is explained by Happiness alone. In the context of social and behavioral data, this represents a relatively strong explanatory power for a single predictor. The adjusted $R^2$ (0.4516) is nearly identical, suggesting the model is stable and not overfitting. However, the remaining unexplained variance indicates that performance is influenced by multiple factors beyond happiness, including education, income, motivation, and personal circumstances.

The residual standard error (0.6275) reflects the average deviation between observed scores and the model’s predictions. Given the scale of the Score variable, this level of error is reasonable and expected in behavioral datasets, where outcomes naturally exhibit variability.

The scatterplot of Score versus Happiness provides a crucial visual complement to the regression results. While the regression statistics quantify the strength and significance of the relationship, the scatterplot shows how individual observations are distributed. The upward-sloping regression line visually confirms the positive relationship between happiness and score. As happiness increases from roughly 5.5 to 9, scores generally rise from around 6 to near 9, reinforcing the interpretation of the regression coefficient.

The scatterplot also reveals the spread of points around the regression line, especially at mid-range happiness values. This dispersion visually explains why $𝑅^2$ is not closer to 1: happiness explains a substantial portion of the re variation, but not all of it. Importantly, the plot shows no extreme outliers or influential points, indicating that the regression results are not driven by a few unusual observations. This strengthens confidence in the model’s validity.

Additionally, the scatterplot helps assess whether a linear model is appropriate. The absence of strong curvature or clustering patterns suggests that a linear relationship is a reasonable assumption for this data. This visual confirmation supports the use of linear regression as an appropriate analytical approach.

Overall, the regression results and scatterplot together provide a comprehensive understanding of the relationship between happiness and performance. The p-value confirms that the relationship is statistically significant; the $R^2$ affirms its practical importance, and the scatterplot adds transparency by revealing variability, consistency, and model suitability. These findings align well with the common understanding that higher well-being is often associated with better performance, while also acknowledging that performance is influenced by multiple interconnected factors.

An example on three predictors is shown below. You can choose your own variables or provide an explanation of the findings for this example

#############################################################
#
# Step 13 - Optional
#
# In case you want to add more variables in your model the following 
# Example is provided.
#############################################################

r2 <- lm(Income~Age+Happiness+Education+Gender, data=df) # it runs the least squares
r2

## 
## Call:
## lm(formula = Income ~ Age + Happiness + Education + Gender, data = df)
## 
## Coefficients:
##          (Intercept)                   Age             Happiness  
##             104720.2                -957.9               -4762.2  
## EducationHigh School     EducationMaster's          EducationPhD  
##              -2908.6               16920.1               20229.8  
##           GenderMale           GenderOther  
##               6980.2                2299.9

summary(r2) # Information about your variables, R^2 and p value are printed

## 
## Call:
## lm(formula = Income ~ Age + Happiness + Education + Gender, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13718.6  -2232.8    187.3   2411.8  11156.4 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          104720.2     6397.2  16.370  < 2e-16 ***
## Age                    -957.9      180.6  -5.304 1.72e-07 ***
## Happiness             -4762.2      483.1  -9.857  < 2e-16 ***
## EducationHigh School  -2908.6      862.3  -3.373 0.000802 ***
## EducationMaster's     16920.1     1069.9  15.814  < 2e-16 ***
## EducationPhD          20229.8     1259.4  16.063  < 2e-16 ***
## GenderMale             6980.2      977.3   7.142 3.32e-12 ***
## GenderOther            2299.9      912.1   2.522 0.011997 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4609 on 492 degrees of freedom
## Multiple R-squared:  0.7064, Adjusted R-squared:  0.7022 
## F-statistic: 169.1 on 7 and 492 DF,  p-value: < 2.2e-16

# In your example you should change the title and the labels of the axis appropriately
# Change Colors
# Explain the outcome.
# Run your model
r2 <- lm(Income ~ Age + Happiness + Education + Gender, data = df)

# View the model and summary
r2

## 
## Call:
## lm(formula = Income ~ Age + Happiness + Education + Gender, data = df)
## 
## Coefficients:
##          (Intercept)                   Age             Happiness  
##             104720.2                -957.9               -4762.2  
## EducationHigh School     EducationMaster's          EducationPhD  
##              -2908.6               16920.1               20229.8  
##           GenderMale           GenderOther  
##               6980.2                2299.9

summary(r2)

## 
## Call:
## lm(formula = Income ~ Age + Happiness + Education + Gender, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13718.6  -2232.8    187.3   2411.8  11156.4 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          104720.2     6397.2  16.370  < 2e-16 ***
## Age                    -957.9      180.6  -5.304 1.72e-07 ***
## Happiness             -4762.2      483.1  -9.857  < 2e-16 ***
## EducationHigh School  -2908.6      862.3  -3.373 0.000802 ***
## EducationMaster's     16920.1     1069.9  15.814  < 2e-16 ***
## EducationPhD          20229.8     1259.4  16.063  < 2e-16 ***
## GenderMale             6980.2      977.3   7.142 3.32e-12 ***
## GenderOther            2299.9      912.1   2.522 0.011997 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4609 on 492 degrees of freedom
## Multiple R-squared:  0.7064, Adjusted R-squared:  0.7022 
## F-statistic: 169.1 on 7 and 492 DF,  p-value: < 2.2e-16

# Set up plotting space: 2 rows, 2 columns
par(mfrow = c(2, 2))

# 1. Residuals vs Fitted plot
# Checks for non-linearity, unequal error variance
plot(r2, which = 1)

# 2. Normal Q-Q plot
# Checks if residuals are normally distributed
plot(r2, which = 2)

# 3. Scale-Location plot
# Checks homoscedasticity (constant variance)
plot(r2, which = 3)

# 4. Residuals vs Leverage plot
# Finds influential observations
plot(r2, which = 5)

# Reset plotting space to normal
par(mfrow = c(1,1))

# --- Additional Useful Diagnostics ---

# 5. Histogram of residuals
hist(r2$residuals,
     main = "Histogram of Residuals",
     xlab = "Residuals",
     col = "lightblue",
     border = "white")

# 6. Residuals vs each predictor
# Helps to spot non-linear patterns individually
par(mfrow = c(1, 4))
plot(df$Age, r2$residuals, main = "Residuals vs Age", xlab = "Age", ylab = "Residuals")
abline(h = 0, col = "red")
plot(df$Happiness, r2$residuals, main = "Residuals vs Happiness", xlab = "Happiness", ylab = "Residuals")
abline(h = 0, col = "red")
plot(df$Education, r2$residuals, main = "Residuals vs Education", xlab = "Education", ylab = "Residuals")
abline(h = 0, col = "red")
plot(df$Gender, r2$residuals, main = "Residuals vs Gender", xlab = "Gender", ylab = "Residuals")
abline(h = 0, col = "red")

par(mfrow = c(1, 1))

# 7. Cook's Distance
# Identifies influential points
cooksd <- cooks.distance(r2)
plot(cooksd, type = "h", main = "Cook's Distance", ylab = "Cook's Distance")
abline(h = 4/length(cooksd), col = "red", lty = 2)  # common threshold

This multiple linear regression model examines how Age, Happiness, Education, and Gender jointly explain variation in Income. In addition to the regression coefficients, a full set of diagnostics plots was reviewed to ass model assumptions, fit, and potential issues.

The fitted model explains a substantial portion of income variability:

Multiple $R^2$ = 0.7064
Adjusted $R^2$ = 0.7022

This means that approximately 70% of the variation in Income is explained by the included predictors, which is a strong result for socioeconomic data. The high F-statistic (169.1) and p-value < 2×10−16 confirm the model as a whole is statistically significant.

Key Coefficients

Age (−957.9, p < 0.001): Holding other variables constant, income decreases by about $958 per additional year of age. This negative effect contrasts with the earlier simple regression of Income on Age and suggests that once education and gender are controlled for, age may reflect late-career plateauing or cohort effects rather than growth.

Happiness (−4,762.2, p < 0.001): Higher happiness is associated with lower income after accounting for other factors. While counterintuitive at first glance, this may reflect trade-offs between income and work-life balance or differences in career priorities.

Education: shows the strongest and most consistent effects:

High School: −$2,909 (p < 0.001)

Master’s: +$16,920 (p < 0.001)

PhD: +$20,230 (p < 0.001)

These results align with economic theory and prior research, confirming that higher educational attainment substantially increases earning potential.

Gender:

Male: +$6,980 (p < 0.001)

Other: +$2,300 (p < 0.05)

These coefficients indicate income differences across gender categories after controlling for education, age, and happiness.

The residual standard error (4,609) suggests that predictions are reasonably accurate given the scale of income values.

Diagnostic Plot Analysis

1. Residuals vs Fitted: The residuals are centered around zero with no strong systematic pattern. This indicates that the linearity assumption is largely satisfied, though slight curvature suggests minor nonlinear effects that could be explored in future models.

2. Q–Q Plot: The Q–Q plot shows residuals closely following the theoretical normal line, with small deviations at the tails. This suggests that the normality assumption is reasonably met, which supports the validity of hypothesis tests and confidence intervals.

3. Scale–Location Plot: The relatively flat trend line indicates homoscedasticity, meaning the variance of residuals is fairly constant across fitted values. This strengthens confidence in the model’s standard errors and inference.

4. Residuals vs Leverage: No observations exceed Cook’s distance thresholds, indicating no influential points dominate the model. This suggests the results are stable and not driven by a small number of extreme cases.

5. Histogram of Residuals The residuals are approximately symmetric and centered around zero, reinforcing the conclusion that the model errors behave as expected.

6. Residuals vs Predictors - Residuals vs Age and Happiness show no strong remaining structure, indicating these variables are appropriately modeled.

Boxplots of residuals across Education and Gender categories show balanced distributions, suggesting that group-level effects are well captured.

Overall Insights

The model provides strong explanatory power for income, primarily driven by education and gender.
Diagnostic plots confirm that key regression assumptions are largely satisfied.
The contrast between simple and multiple regression results (especially for Age and Happiness) highlights the importance of controlling for confounding variables.
These findings align with real-world understanding that income is shaped by a combination of education, demographics, and personal factors rather than any single variable.

Overall, this analysis demonstrates a well-specified and statistically sound predictive model, suitable for interpretation and further refinement.

Predictive Data

A. Toure

2026-02-05

Already complete Code

Setting up your directory in your computer.

Setting up your personalized data

Knit your file

Cleaning up your data

Your Turn

Cleanup Continued

Cleanup Continued

Your Turn

Clean Up Continued

Your Turn

CleanUp Continued

Your Turn

Imputation

Your Turn

Descriptive Statistics

Descriptive Statistics Continued

Your Turn

Bargraph - if the variable of your choice is categorical

Histogram - If the varaibles of your choice is Numerical

Your Turn

Your Turn

Your Turn

Your Turn

Your Turn

Tabular Representations

Making Pretty Tables

Your Turn

Predictive Modeling

Your Turn

Your Turn

Visual representations

Your Turn

Your Regression DIagnostics

Your Turn